'predict' gives different results than using manually the coefficients from 'summary' - r

Let me state my confusion with the help of an example,
#making datasets
x1<-iris[,1]
x2<-iris[,2]
x3<-iris[,3]
x4<-iris[,4]
dat<-data.frame(x1,x2,x3)
dat2<-dat[1:120,]
dat3<-dat[121:150,]
#Using a linear model to fit x4 using x1, x2 and x3 where training set is first 120 obs.
model<-lm(x4[1:120]~x1[1:120]+x2[1:120]+x3[1:120])
#Usig the coefficients' value from summary(model), prediction is done for next 30 obs.
-.17947-.18538*x1[121:150]+.18243*x2[121:150]+.49998*x3[121:150]
#Same prediction is done using the function "predict"
predict(model,dat3)
My confusion is: the two outcomes of predicting the last 30 values differ, may be to a little extent, but they do differ. Whys is it so? should not they be exactly same?

The difference is really small, and I think is just due to the accuracy of the coefficients you are using (e.g. the real value of the intercept is -0.17947075338464965610... not simply -.17947).
In fact, if you take the coefficients value and apply the formula, the result is equal to predict:
intercept <- model$coefficients[1]
x1Coeff <- model$coefficients[2]
x2Coeff <- model$coefficients[3]
x3Coeff <- model$coefficients[4]
intercept + x1Coeff*x1[121:150] + x2Coeff*x2[121:150] + x3Coeff*x3[121:150]

You can clean your code a bit. To create your training and test datasets you can use the following code:
# create training and test datasets
train.df <- iris[1:120, 1:4]
test.df <- iris[-(1:120), 1:4]
# fit a linear model to predict Petal.Width using all predictors
fit <- lm(Petal.Width ~ ., data = train.df)
summary(fit)
# predict Petal.Width in test test using the linear model
predictions <- predict(fit, test.df)
# create a function mse() to calculate the Mean Squared Error
mse <- function(predictions, obs) {
sum((obs - predictions) ^ 2) / length(predictions)
}
# measure the quality of fit
mse(predictions, test.df$Petal.Width)
The reason why your predictions differ is because the function predict() is using all decimal points whereas on your "manual" calculations you are using only five decimal points. The summary() function doesn't display the complete value of your coefficients but approximate the to five decimal points to make the output more readable.

Related

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

Regression with weights: Less standardized residuals then observations

I modelled a multiple Regression based on the Mincer-Wage-Equation and I added a weighting-factor to make it representative for the whole population.
But when I'm adding the weights function into my modell, R calculates less standardized residuals than I have observations.
Here's my modell:
lm(log(earings) ~ Gender + Age + Age^2 + Education, weights= phrf)
So I got problems to analyze the residuals because when I'm trying to plot the rstandard against the fitted.values R is telling: Different Variable Length in rstandard() found.
This Problem ist only by rstandard and rstudent, when I'm plotting the normal resid() against fitted.values there is no problem.
And when I'm leaving out the weights function I have not problems, too.
In the help file for rstudent():
Note that cases with weights == 0 are dropped from all these functions, but that if a linear model has been fitted with na.action = na.exclude, suitable values are filled in for the cases excluded during fitting.
A simple example to demonstrate:
set.seed(123)
x <- 1:100
y <- x + rnorm(100)
w <- runif(100)
w[44] <- 0
fit <- lm(y ~ x, weights=w)
length(fitted(fit))
length(rstudent(fit))
Gives:
> length(fitted(fit))
[1] 100
> length(rstudent(fit))
[1] 99
And this makes sense. If you have a weight of 0, the theoretical variance is 0 which is an infinite studentized or standardized residual.
Since you are effectively deleting those observations, you can subset the call to lm with subset=w!=0 or you can use that flag for the fitted values:
plot(fitted(fit)[w!=0], rstudent(fit))

How to plot glm model coefficients with abline in R?

I'm struggling to plot the cofficients of an glm model using abline. Lets take this simple 2D example:
d <- iris[51:150, c(3:4,5)]
d[,3] <- factor(d[,3])
plot(d[,1:2], col=d[,3])
The glm model yields 4 coefficients:
m <- glm(formula = Species~Petal.Length*Petal.Width, data = d, family = "binomial")
m$coefficients
# (Intercept) Petal.Length Petal.Width Petal.Length:Petal.Width
# -131.23813 22.93553 63.63527 -10.63606
How to plot those with a simple abline?
Binomial models are usually not set up like this. You usually will have a single 0|1 response variable (i.e. predict whether a sample in a single species). Maybe because you only have 2 species included in your model, it still seems to work (this is not that case when all 3 spp are included).
The second trick is to predict type="response" and round these values to get discrete predictions:
d$pred <- factor(levels(d[,3])[round(predict(m, type="response"))+1])
plot(d[,1:2], col=d[,3])
points(d[,1:2], col=d$pred, pch=4)
here I've added an "X" for the predictions. If color is the same, then the prediction was correct. I count 5 samples where the prediction was incorrect.

Have others seen a discontinuity in e1071::predict.svm() probabilities?

I have been using e1071::svm(...,probability=TRUE) in R to fit a binary SVM classifier and then using predict.svm() to get the probabilities for both the training sample and a test sample. When I converted the probabilities to log(odds) and plotted them against the decision.values, I found there was a discontinuity in the predictions:
Plot of log(odds) = log(prob/(1-prob)) vs. Decision Values
This is happening with other models as well, whenever the probability is below about 0.25%; there is consistently a gap from log(odds)= -5.98 to -10.86. Note that this does not occur at a fixed decision.value (which varies with the model). I believe it may also be happening at high probabilities (>99%) as well.
The red and green lines are linear fits for the predictions with log(odds)<-8 and >-8, respectively. The coefficients of the latter agree with the probA and probB outputs returned with the svm object. I have seen other cases where the gap occurs from from +5.98 to +10.86 (only).
Here is the example using the iris dataset:
require("datasets")
require("e1071")
iris$is.setosa <- as.numeric(iris$Species=="setosa")
set.seed(8675309)
fit <- svm(
is.setosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iris,probability=T,cost=0.01,kernel="linear",type="C-classification")
preds <- predict(fit,prob=TRUE,newdata=iris,decision=T)
DVs <- attr(preds,"decision.values")[,1]
probs <- attr(preds,"probabilities")[,"1"]
logodds <- log(probs/(1-probs))
plot(DVs,logodds,xlab="decision.values",ylab="log(odds)",main="IRIS dataset")
cat("Coefficents of probability model reported by svm():\n")
print(fit[c("probA","probB")])
fit <- lm(logodds ~ DVs,subset=which(logodds> -8))
cat("fit of logodds ~ DVs when log(odds) greater than -8:\n")
print(summary(fit))
abline(fit,col="green",lty=3)
fit <- lm(logodds ~ DVs,subset=which(logodds< -8))
abline(fit,col="red",lty=3)
Has anyone else seen this behavior? Any idea what might be causing it? Thanks!

Generating predictive simulations from a multilevel model with random intercepts

I am struggling to understand how, in R, to generate predictive simulations for new data using a multilevel linear regression model with a single set of random intercepts. Following the example on pp. 146-147 of this text, I can execute this task for a simple linear model with no random effects. What I can't wrap my head around is how to extend the set-up to accommodate random intercepts for a factor added to that model.
I'll use iris and some fake data to show where I'm getting stuck. I'll start with a simple linear model:
mod0 <- lm(Sepal.Length ~ Sepal.Width, data = iris)
Now let's use that model to generate 1,000 predictive simulations for 250 new cases. I'll start by making up those cases:
set.seed(20912)
fakeiris <- data.frame(Sepal.Length = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
Sepal.Width = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
Species = sample(as.character(unique(iris$Species)), 250, replace = TRUE),
stringsAsFactors=FALSE)
Following the example in the aforementioned text, here's what I do to get 1,000 predictive simulations for each of those 250 new cases:
library(arm)
n.sims = 1000 # set number of simulations
n.tilde = nrow(fakeiris) # set number of cases to simulate
X.tilde <- cbind(rep(1, n.tilde), fakeiris[,"Sepal.Width"]) # create matrix of predictors describing those cases; need column of 1s to multiply by intercept
sim.fakeiris <- sim(mod0, n.sims) # draw the simulated coefficients
y.tilde <- array(NA, c(n.sims, n.tilde)) # build an array to hold results
for (s in 1:n.sims) { y.tilde[s,] <- rnorm(n.tilde, X.tilde %*% sim.fakeiris#coef[s,], sim.fakeiris#sigma[s]) } # use matrix multiplication to fill that array
That works fine, and now we can do things like colMeans(y.tilde) to inspect the central tendencies of those simulations, and cor(colMeans(y.tilde), fakeiris$Sepal.Length) to compare them to the (fake) observed values of Sepal.Length.
Now let's try an extension of that simple model in which we assume that the intercept varies across groups of observations --- here, species. I'll use lmer() from the lme4 package to estimate a simple multilevel/hierarchical model that matches that description:
library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
Okay, that works, but now what? I run:
sim.fakeiris.lmer <- sim(mod1, n.sims)
When I use str() to inspect the result, I see that it is an object of class sim.merMod with three components:
#fixedef, a 1,000 x 2 matrix with simulated coefficients for the fixed effects (the intercept and Sepal.Width)
#ranef, a 1,000 x 3 matrix with simulated coefficients for the random effects (the three species)
#sigma, a vector of length 1,000 containing the sigmas associated with each of those simulations
I can't wrap my head around how to extend the matrix construction and multiplication used for the simple linear model to this situation, which adds another dimension. I looked in the text, but I could only find an example (pp. 272-275) for a single case in a single group (here, species). The real-world task I'm aiming to perform involves running simulations like these for 256 new cases (pro football games) evenly distributed across 32 groups (home teams). I'd greatly appreciate any assistance you can offer.
Addendum. Stupidly, I hadn't looked at the details on simulate.merMod() in lme4 before posting this. I have now. It seems like it should do the trick, but when I run simulate(mod0, nsim = 1000, newdata = fakeiris), the result has only 150 rows. The values look sensible, but there are 250 rows (cases) in fakeiris. Where is that 150 coming from?
One possibility is to use the predictInterval function from the merTools package. The package is about to be submitted to CRAN, but the current developmental release is available for download from GitHub,
install.packages("devtools")
devtools::install_github("jknowles/merTools")
To get the median and a 95% credible interval of 100 simulations:
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
out <- predictInterval(mod1, newdata=fakeiris, level=0.95,
n.sims=100, stat="median")
By default, predictInterval includes the residual variation, but you can
turn that feature off with:
out2 <- predictInterval(mod1, newdata=fakeiris, level=0.95,
n.sims=100, stat="median",
include.resid.var=FALSE)
Hope this helps!
This might help: it doesn't use sim(), but instead uses mvrnorm() to draw the new coefficients from the sampling distribution of the fixed-effect parameters, uses a bit of internal machinery (setBeta0) to reassign the internal values of the fixed-effect coefficients. The internal values of the random effect coefficients are automatically resampled by simulate.merMod using the default argument re.form=NA. However, the residual variance is not resampled -- it is held fixed across the simulations, which isn't 100% realistic.
In your use case, you would specify newdata=fakeiris.
library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
simfun <- function(object,n=1,newdata=NULL,...) {
v <- vcov(object)
b <- fixef(object)
betapars <- MASS::mvrnorm(n,mu=b,Sigma=v)
npred <- if (is.null(newdata)) {
length(predict(object))
} else nrow(newdata)
res <- matrix(NA,npred,n)
for (i in 1:n) {
mod1#pp$setBeta0(betapars[i,])
res[,i] <- simulate(mod1,newdata=newdata,...)[[1]]
}
return(res)
}
ss <- simfun(mod1,100)

Resources