regression line and confidence interval in R: GLMM with several fixed effects - r

Somehow as a follow up on the question Creating confidence intervals for regression curve in GLMM using Bootstrapping, I am interested in getting the correct values of a regression curve and the associated confidence interval curves.
Consider a case where in a GLMM, there is one response variable, two continuous fixed effects and one random effect. Here is some fake data:
library (dplyr)
set.seed (1129)
x1 <- runif(100,0,1)
x2 <- rnorm(100,0.5,0.4)
f1 <- gl(n = 5,k = 20)
rnd1<-rnorm(5,0.5,0.1)
my_data <- data.frame(x1=x1, x2=x2, f1=f1)
modmat <- model.matrix(~x1+x2, my_data)
fixed <- c(-0.12,0.35,0.09)
y <- (modmat%*%fixed+rnd1)
my_data$y <- ((y - min (y))/max(y- min (y))) %>% round (digits = 1)
rm (y)
The GLMM that I fit looks like this:
m1<-glmer (y ~x1+x2+(1|f1), my_data, family="binomial")
summary (m1)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: y ~ x1 + x2 + (1 | f1)
Data: my_data
AIC BIC logLik deviance df.resid
65.7 76.1 -28.8 57.7 96
Scaled residuals:
Min 1Q Median 3Q Max
-8.4750 -0.7042 -0.0102 1.5904 14.5919
Random effects:
Groups Name Variance Std.Dev.
f1 (Intercept) 1.996e-10 1.413e-05
Number of obs: 100, groups: f1, 5
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.668 2.051 -4.713 2.44e-06 ***
x1 12.855 2.659 4.835 1.33e-06 ***
x2 4.875 1.278 3.816 0.000136 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) x1
x1 -0.970
x2 -0.836 0.734
convergence code: 0
boundary (singular) fit: see ?isSingular
Plotting y vs x1:
plot (y~x1, my_data)
It should be possible to get a regression curve from the summary of m1. I have learned that I need to reverse the link-function (in this case, "logit"):
y = 1/(1+exp(-(Intercept+b*x1+c*x2)))
In order to plot a regression curve of x1 in a two-dimensional space, I set x2 = mean(x2) in the formula (which also seems important - the red line in the following plots ignores x2, apparently leading to considerable bias). The regression line:
xx <- seq (from = 0, to = 1, length.out = 100)
yy <- 1/(1+exp(-(-9.668+12.855*xx+4.875*mean(x2))))
yyy <- 1/(1+exp(-(-9.668+12.855*xx)))
lines (yy ~ xx, col = "blue")
lines (yyy~ xx, col = "red")
I think, the blue line looks not so good (and the red line worse, of course). So as a side-question: is y = 1/(1+exp(-(Intercept+b*x1+c*x2))) always the right choice as a back-transformation of the logit-link? I am asking because I found this https://sebastiansauer.github.io/convert_logit2prob/, which made me suspicious. Or is there another reason for the model not to fit so well? Maybe my data creation process is somewhat 'bad'.
What I need now is to add the 95%-confidence interval to the curve. I think that Bootstrapping using the bootMer function should be a good approach. However, all examples that I found were on models with one single fixed effect. #Jamie Murphy asked a similar question, but he was interested in models containing a continuous and a categorical variable as fixed effects here: Creating confidence intervals for regression curve in GLMM using Bootstrapping
But when it comes to models with more than one continuous variables as fixed effects, I get lost. Perhaps someone can help solve this issue - possibly with a modification of the second part of this tutorial:
https://www.r-bloggers.com/2015/06/confidence-intervals-for-prediction-in-glmms/

Related

Logistic regression parameter P-value changes after logarithm - R

I have an issue when calculating logistic regression in R that, to me, makes no sense.
I have one parameter in the model, positive numbers (molecular weight).
I have a binary response variable, let's say either A or B.
My data table is called df1.
str(df1)
data.frame': 1015 obs. of 2 variables:
$ Protein_Class: chr "A" "A" "A" "B" ...
$ MW : num 47114 29586 26665 34284 104297 ...
I make the model:
summary(glm(as.factor(df1[,1]) ~ df1[,2],family="binomial"))
The results are:
Call:
glm(formula = as.factor(df1[, 1]) ~ df1[, 2], family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5556 -1.5516 0.8430 0.8439 0.8507
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.562e-01 1.251e-01 6.842 7.8e-12 ***
df1[, 2] -1.903e-07 3.044e-06 -0.063 0.95
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1239.2 on 1014 degrees of freedom
Residual deviance: 1239.2 on 1013 degrees of freedom
AIC: 1243.2
Number of Fisher Scoring iterations: 4
That's all fine and good until this point.
But, when I take the logarithm of my variable:
summary(glm(as.factor(df1[,1]) ~ log10(df1[,2]),family="binomial"))
Call:
glm(formula = as.factor(df1[, 1]) ~ log10(df1[, 2]), family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8948 -1.4261 0.8007 0.8528 1.0469
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.7235 1.1169 -2.438 0.01475 *
log10(df1[, 2]) 0.8038 0.2514 3.197 0.00139 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1239.2 on 1014 degrees of freedom
Residual deviance: 1228.9 on 1013 degrees of freedom
AIC: 1232.9
Number of Fisher Scoring iterations: 4
The p-value has changed!
How can this be? And more importantly, which one to use?
My understanding was that logistic regression is based on ranks, and all I do is a monotone transformation. Note, that the AUROC curve of the model remains the same.
There are no zero or negative values that are lost during the transformation.
Did I miss something here?
Any advice?
Thanks in advance,
Adam
There are a couple of things to think about. First, you can probably constrain your search to one side or the other of 1. That is decreasing the power on x - square root, log, inverse, etc... - all have a similar type of effect, but to differing degrees. They all pull in big values and spread out small values. The transformations greater than 1 do the opposite, they tend to increase the spread among big values and decrease the spread among small values - all generally assuming you've got no non-positive values in your variable. This is really, then, a question about what kind of transformation you want and then after that - how severe does it have to be.
First, what kind of transformation do you need. I made some fake data to illustrate the point:
library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(1234)
x <- runif(1000, 1, 10000)
y.star <- -6 + log(x)
y <- rbinom(1000, 1, plogis(y.star) )
df <- tibble(
y=y,
x=x,
ystar=y.star)
Next, since this is just a bivariate relationship, we could plot it out with a loess curve. In particular, though, we want to know what the log-odds of y look like with respect to x. We can do this by transforming the predictions from the loess curve with the logistic quantile function, qlogis() - this takes the probabilities and puts them in log-odds form. Then, we could make the plot.
lo <- loess(y ~ x, span=.75)
df <- df %>% mutate(fit = predict(lo),
fit = case_when(
fit < .01 ~ .01,
fit > .99 ~ .99,
TRUE ~ fit))
ggplot(df) +
geom_line(aes(x=x, y=qlogis(fit)))
This looks like a class log relationship. We could then implement a few different transformations and plot those - square root, log and negative inverse.
lo1 <- loess(y ~ sqrt(x), span=.5)
lo2 <- loess(y ~ log(x), span=.5)
lo3 <- loess(y ~ I(-(1/x)), span=.5)
df <- df %>% mutate(fit1 = predict(lo1),
fit1 = case_when(
fit1 < .01 ~ .01,
fit1 > .99 ~ .99,
TRUE ~ fit1))
df <- df %>% mutate(fit2 = predict(lo2),
fit2 = case_when(
fit2 < .01 ~ .01,
fit2 > .99 ~ .99,
TRUE ~ fit2))
df <- df %>% mutate(fit3 = predict(lo3),
fit3 = case_when(
fit3 < .01 ~ .01,
fit3 > .99 ~ .99,
TRUE ~ fit3))
Next, we need to transform the data so the plotting will look right:
plot.df <- df %>%
tidyr::pivot_longer(cols=starts_with("fit"),
names_to="var",
values_to="vals") %>%
mutate(x2 = case_when(
var == "fit" ~ x,
var == "fit1" ~ sqrt(x),
var == "fit2" ~ log(x),
var == "fit3" ~ -(1/x),
TRUE ~ x),
var = factor(var, labels=c("Original", "Square Root", "Log", "Inverse")))
Then, we can make the plot:
ggplot(plot.df, aes(x=x2, y=vals)) +
geom_line() +
facet_wrap(~var, scales="free_x")
Here, it looks like the log is the most linear of the bunch - not surprising since we made the variable y.star with log(x). If we wanted to test between these different possibilities, Kevin Clarke, a Political Scientist at Rochester proposed a paired sign test for evaluating the difference between non-nested models. There is a paper about it here. I wrote a package called clarkeTest that implements this in R. So, we could use this to test the various different alternatives:
m0 <- glm(y ~ x, data=df, family=binomial)
m1 <- glm(y ~ sqrt(x), data=df, family=binomial)
m2 <- glm(y ~ log(x), data=df, family=binomial)
m3 <- glm(y ~ I(-(1/x)), data=df, family=binomial)
Testing the original against the square root:
library(clarkeTest)
> clarke_test(m0, m1)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -309
# Model 2 log-likelihood: -296
# Observations: 1000
# Test statistic: 400 (40%)
#
# Model 2 is preferred (p = 2.7e-10)
This shows that the square root is better than the original un-transformed variable.
clarke_test(m0, m2)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -309
# Model 2 log-likelihood: -284
# Observations: 1000
# Test statistic: 462 (46%)
#
# Model 2 is preferred (p = 0.018)
The above shows that the log is better than the un-transformed variable.
> clarke_test(m0, m3)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -309
# Model 2 log-likelihood: -292
# Observations: 1000
# Test statistic: 550 (55%)
#
# Model 1 is preferred (p = 0.0017)
The above shows that the un-transformed variable is preferred to the negative inverse. Then, we can test the difference of the two models preferred to the original.
> clarke_test(m1, m2)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -296
# Model 2 log-likelihood: -284
# Observations: 1000
# Test statistic: 536 (54%)
#
# Model 1 is preferred (p = 0.025)
This shows that the the square root is better than the log transformation in terms of individual log-likelihoods.
Another option would be a grid search over possible transformations and look at the AIC each time. We first have to make a function to deal with the situation where the transformation power = 0, where we should substitute the log. Then we can run a model for each different transformation and get the AICs.
grid <- seq(-1,1, by=.1)
trans <- function(x, power){
if(power == 0){
tx <- log(x)
}else{
tx <- x^power
}
tx
}
mods <- lapply(grid, function(p)glm(y ~ trans(x, p),
data=df,
family=binomial))
aic.df <- tibble(
power = grid,
aic = sapply(mods, AIC))
Next, we can plot the AICs as a function of the power.
ggplot(aic.df, aes(x=power, y=aic)) +
geom_line()
This tells us that about -.25 is the appropriate transformation parameter. Note that there is a discrepancy between the Clarke test results and the AIC because AIC is based on the overall log-likelihood and the Clarke test is based on differences in the individual log-likelihoods.
We would find that this new proposed transformation is also worse than the square root:
m4 <- glm(y ~ I(x^-.25), data=df, family=binomial)
clarke_test(m1, m4)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -296
# Model 2 log-likelihood: -283
# Observations: 1000
# Test statistic: 559 (56%)
#
# Model 1 is preferred (p = 0.00021)
So, if you have a couple of different candidates in mind and you like the idea behind the Clarke test, you could use that to find the appropriate transformation. If you don't have a candidate in mind, a grid search is always a possibility.

Simulate data for logistic regression with fixed r2

I would like to simulate data for a logistic regression where I can specify its explained variance beforehand. Have a look at the code below. I simulate four independent variables and specify that each logit coefficient should be of size log(2)=0.69. This works nicely, the explained variance (I report Cox & Snell's r2) is 0.34.
However, I need to specify the regression coefficients in such a way that a pre-specified r2 will result from the regression. So if I would like to produce an r2 of let's say exactly 0.1. How do the coefficients need to be specified? I am kind of struggling with this..
# Create independent variables
sigma.1 <- matrix(c(1,0.25,0.25,0.25,
0.25,1,0.25,0.25,
0.25,0.25,1,0.25,
0.25,0.25,0.25,1),nrow=4,ncol=4)
mu.1 <- rep(0,4)
n.obs <- 500000
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))
# Create latent continuous response variable
sample1$ystar <- 0 + log(2)*sample1$V1 + log(2)*sample1$V2 + log(2)*sample1$V3 + log(2)*sample1$V4
# Construct binary response variable
sample1$prob <- exp(sample1$ystar) / (1 + exp(sample1$ystar))
sample1$y <- rbinom(n.obs,size=1,prob=sample1$prob)
# Logistic regression
logreg <- glm(y ~ V1 + V2 + V3 + V4, data=sample1, family=binomial)
summary(logreg)
The output is:
Call:
glm(formula = y ~ V1 + V2 + V3 + V4, family = binomial, data = sample1)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7536 -0.7795 -0.0755 0.7813 3.3382
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.002098 0.003544 -0.592 0.554
V1 0.691034 0.004089 169.014 <2e-16 ***
V2 0.694052 0.004088 169.776 <2e-16 ***
V3 0.693222 0.004079 169.940 <2e-16 ***
V4 0.699091 0.004081 171.310 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 693146 on 499999 degrees of freedom
Residual deviance: 482506 on 499995 degrees of freedom
AIC: 482516
Number of Fisher Scoring iterations: 5
And Cox and Snell's r2 gives:
library(pscl)
pR2(logreg)["r2ML"]
> pR2(logreg)["r2ML"]
r2ML
0.3436523
If you add a random error term to the ystar variable making ystat.r and then work with that, you can tweek the standard deviation until it meets you specifications.
sample1$ystar.r <- sample1$ystar+rnorm(n.obs, 0, 3.8) # tried a few values
sample1$prob <- exp(sample1$ystar.r) / (1 + exp(sample1$ystar.r))
sample1$y <- rbinom(n.obs,size=1,prob=sample1$prob)
logreg <- glm(y ~ V1 + V2 + V3 + V4, data=sample1, family=binomial)
summary(logreg) # the estimates "shrink"
pR2(logreg)["r2ML"]
#-------
r2ML
0.1014792
R-squared (and its variations) is a random variable, as it depends on your simulated data. If you simulate data with the exact same parameters multiple times, you'll most likely get different values for R-squared each time. Therefore, you cannot produce a simulation where the R-squared will be exactly 0.1 just by controlling the parameters.
On the other hand, since it's a random variable, you could potentially simulate your data from a conditional distribution (conditioning on a fixed value of R-squared), but you would need to find out what these distributions look like (math might get really ugly here, cross validated is more appropriate for this part).

How to obtain Poisson's distribution "lambda" from R glm() coefficients

My R-script produces glm() coeffs below.
What is Poisson's lambda, then? It should be ~3.0 since that's what I used to create the distribution.
Call:
glm(formula = h_counts ~ ., family = poisson(link = log), data = pois_ideal_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-22.726 -12.726 -8.624 6.405 18.515
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.222532 0.015100 544.53 <2e-16 ***
h_mids -0.363560 0.004393 -82.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 11451.0 on 10 degrees of freedom
Residual deviance: 1975.5 on 9 degrees of freedom
AIC: 2059
Number of Fisher Scoring iterations: 5
random_pois = rpois(10000,3)
h=hist(random_pois, breaks = 10)
mean(random_pois) #verifying that the mean is close to 3.
h_mids = h$mids
h_counts = h$counts
pois_ideal_data <- data.frame(h_mids, h_counts)
pois_ideal_model <- glm(h_counts ~ ., pois_ideal_data, family=poisson(link=log))
summary_ideal=summary(pois_ideal_model)
summary_ideal
What are you doing here???!!! You used a glm to fit a distribution???
Well, it is not impossible to do so, but it is done via this:
set.seed(0)
x <- rpois(10000,3)
fit <- glm(x ~ 1, family = poisson())
i.e., we fit data with an intercept-only regression model.
fit$fitted[1]
# 3.005
This is the same as:
mean(x)
# 3.005
It looks like you're trying to do a Poisson fit to aggregated or binned data; that's not what glm does. I took a quick look for canned ways of fitting distributions to canned data but couldn't find one; it looks like earlier versions of the bda package might have offered this, but not now.
At root, what you need to do is set up a negative log-likelihood function that computes (# counts)*prob(count|lambda) and minimize it using optim(); the solution given below using the bbmle package is a little more complex up-front but gives you added benefits like easily computing confidence intervals etc..
Set up data:
set.seed(101)
random_pois <- rpois(10000,3)
tt <- table(random_pois)
dd <- data.frame(counts=unname(c(tt)),
val=as.numeric(names(tt)))
Here I'm using table rather than hist because histograms on discrete data are fussy (having integer cutpoints often makes things confusing because you have to be careful about right- vs left-closure)
Set up density function for binned Poisson data (to work with bbmle's formula interface, the first argument must be called x, and it must have a log argument).
dpoisbin <- function(x,val,lambda,log=FALSE) {
probs <- dpois(val,lambda,log=TRUE)
r <- sum(x*probs)
if (log) r else exp(r)
}
Fit lambda (log link helps prevent numerical problems/warnings from negative lambda values):
library(bbmle)
m1 <- mle2(counts~dpoisbin(val,exp(loglambda)),
data=dd,
start=list(loglambda=0))
all.equal(unname(exp(coef(m1))),mean(random_pois),tol=1e-6) ## TRUE
exp(confint(m1))
## 2.5 % 97.5 %
## 2.972047 3.040009

model checking and test of overdispersion for glmer

I am testing differences on the number of pollen grains loading on plant stigmas in different habitats and stigma types.
My sample design comprises two habitats, with 10 sites each habitat.
In each site, I have up to 3 stigma types (wet, dry and semidry), and for each stigma stype, I have different number of plant species, with different number of individuals per plant species (code).
So, I ended up with nested design as follow: habitat/site/stigmatype/stigmaspecies/code
As it is a descriptive study, stigmatype, stigmaspecies and code vary between sites.
My response variable (n) is the number of pollengrains (log10+1)per stigma per plant, average because i collected 3 stigmas per plant.
Data doesnt fit Poisson distribution because (i) is not integers, and (ii) variance much higher than the mean (ratio = 911.0756). So, I fitted as negative.binomial.
After model selection, I have:
m4a <- glmer(n ~ habitat*stigmatype + (1|stigmaspecies/code),
family=negative.binomial(2))
> summary(m4a)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: Negative Binomial(2) ( log )
Formula: n ~ habitat * stigmatype + (1 | stigmaspecies/code)
AIC BIC logLik deviance
993.9713 1030.6079 -487.9856 975.9713
Random effects:
Groups Name Variance Std.Dev.
code:stigmaspecies (Intercept) 1.034e-12 1.017e-06
stigmaspecies (Intercept) 4.144e-02 2.036e-01
Residual 2.515e-01 5.015e-01
Number of obs: 433, groups: code:stigmaspecies, 433; stigmaspecies, 41
Fixed effects:
Estimate Std. Error t value Pr(>|z|)
(Intercept) -0.31641 0.08896 -3.557 0.000375 ***
habitatnon-invaded -0.67714 0.10060 -6.731 1.68e-11 ***
stigmatypesemidry -0.24193 0.15975 -1.514 0.129905
stigmatypewet -0.07195 0.18665 -0.385 0.699885
habitatnon-invaded:stigmatypesemidry 0.60479 0.22310 2.711 0.006712 **
habitatnon-invaded:stigmatypewet 0.16653 0.34119 0.488 0.625491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) hbttn- stgmtyps stgmtypw hbttnn-nvdd:stgmtyps
hbttnn-nvdd -0.335
stgmtypsmdr -0.557 0.186
stigmatypwt -0.477 0.160 0.265
hbttnn-nvdd:stgmtyps 0.151 -0.451 -0.458 -0.072
hbttnn-nvdd:stgmtypw 0.099 -0.295 -0.055 -0.403 0.133
Two questions:
How do I check for overdispersion from this output?
What is the best way to go through model validation here?
I have been using:
qqnorm(resid(m4a))
hist(resid(m4a))
plot(fitted(m4a),resid(m4a))
While qqnorm() and hist() seem ok, and there is a tendency of heteroscedasticity on the 3rd graph. And here is my final question:
Can I go through model validation with this graph in glmer? or is there a better way to do it? if not, how much should I worry about the 3rd graph?
a simple way to check for overdispersion in glmer is:
> library("blmeco")
> dispersion_glmer(your_model) #it shouldn't be over
> 1.4
To solve overdispersion I usually add an observation level random factor
For model validation I usually start from these plots...but then depends on your specific model...
par(mfrow=c(2,2))
qqnorm(resid(your_model), main="normal qq-plot, residuals")
qqline(resid(your_model))
qqnorm(ranef(your_model)$id[,1])
qqline(ranef(your_model)$id[,1])
plot(fitted(your_model), resid(your_model)) #residuals vs fitted
abline(h=0)
dat_kackle$fitted <- fitted(your_model) #fitted vs observed
plot(your_data$fitted, jitter(your_data$total,0.1))
abline(0,1)
hope this helps a little....
cheers
Just an addition to Q1 for those who might find this by googling: the blmco dispersion_glmer function appears to be outdated. It is better to use #Ben_Bolker's function for this purpose:
overdisp_fun <- function(model) {
rdf <- df.residual(model)
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}
Source: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion.
With the highlighted notion:
Do PLEASE note the usual, and extra, caveats noted here: this is an APPROXIMATE estimate of an overdispersion parameter.
PS. Why outdated?
The lme4 package includes the residuals function these days, and Pearson residuals are supposedly more robust for this type of calculation than the deviance residuals. The blmeco::dispersion_glmer sums up the deviance residuals together with u cubed, divides by residual degrees of freedom and takes a square root of the value (the function):
dispersion_glmer <- function (modelglmer)
{
n <- length(resid(modelglmer))
return(sqrt(sum(c(resid(modelglmer), modelglmer#u)^2)/n))
}
The blmeco solution gives considerably higher deviance/df ratios than Bolker's function. Since Ben is one of the authors of the lme4 package, I would trust his solution more although I am not qualified to rationalize the statistical reason.
x <- InsectSprays
x$id <- rownames(x)
mod <- lme4::glmer(count ~ spray + (1|id), data = x, family = poisson)
blmeco::dispersion_glmer(mod)
# [1] 1.012649
overdisp_fun(mod)
# chisq ratio rdf p
# 55.7160734 0.8571704 65.0000000 0.7873823

Polynomial data and R's glm()

How can you get R's glm() to match polynomial data? I've tried several iterations of 'family=AAA(link="BBB")' but I can't seem to get trivial predictions to match.
For example, please help with R's glm to match polynomial data
x=seq(-6,6,2)
y=x*x
parabola=data.frame(x,y)
plot(parabola)
model=glm(y~x,dat=parabola)
test=data.frame(x=seq(-5,5,2))
test$y=predict(model,test)
plot(test)
The plot(parabola) looks as expected, but I can find the incantation of glm() that will make plot(test) look parabolic.
I think you need to step back and start to think about a model and how you represent this in R. In your example, y is a quadratic function of x, so you need to include x and x^2 in the model formula, i.e. as predictors you need to estimate the effect of x and x^2 on the response given the data to hand.
If y is Gaussian, conditional upon the model, then you can do this with lm() and either
y ~ x + I(x^2)
or
y ~ poly(x, 2)
In the first, we wrap the quadratic term in I() as the ^ operator has a special meaning (not its mathematical one) in an R model formula. The latter version gives orthogonal polynomials and hence the x and x^2 terms won't be correlated which can help with fitting, however in some cases interpreting the coefficients is trickier with poly().
Putting it all together we have (note that I add some random error to y so as to not predict it perfectly as the example I use is more common in reality):
x <- seq(-6 ,6 ,2)
y <- x^2 + rnorm(length(x), sd = 2)
parabola <- data.frame(x = x, y = y)
mod <- lm(y ~ poly(x, 2), data = parabola)
plot(parabola)
lines(fitted(mod) ~ x, data = parabola, col = "red")
The plot produced is:
An additional issue is whether y is Gaussian? If y can't be negative (i.e. a count), and/or is discrete, modelling using lm() is going to be wrong. That's where glm() might come in, by which you might fit a curve without needing x^2 (although if the data really are a parabola, then x on its own isn't going to fit the response), as there is an explicit transformation of the data from the linear predictor on to the scale of the response.
It is better to think about the properties of the data and the sort of model you want to fit and then build up the degree of polynomial within that modelling framework, rather than jumping in a trying various incantations to simply curve fit the data.
The match is now perfect. A slightly more interesting parabola:
x=seq(-16,16,2)
y= 4*x*x + 10*x + 6
parabola=data.frame(x,y)
plot(parabola)
model=lm(y~poly(x,2),dat=parabola)
summary(model)
test=data.frame(x=seq(-15,15,2))
test$y=predict(model,test)
points(test,pch=3)
An amateur (like me) might expect the coefficients of the model to be (4,10,6) to match 4*x*x + 10*x + 6
Call:
lm(formula = y ~ poly(x, 2), data = parabola)
Residuals:
Min 1Q Median 3Q Max
-3.646e-13 -8.748e-14 -3.691e-14 4.929e-14 6.387e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.900e+02 5.192e-14 7.511e+15 <2e-16 ***
poly(x, 2)1 4.040e+02 2.141e-13 1.887e+15 <2e-16 ***
poly(x, 2)2 1.409e+03 2.141e-13 6.581e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.141e-13 on 14 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.343e+31 on 2 and 14 DF, p-value: < 2.2e-16
Why would the coefficients be (390,404,1409)?

Resources