R confusionMatrix error data and reference factors with same levels - r

I'm trying to understand how to make a confusion matrix after I use the glm function for a logistic regression. Here is my code so far. I am using the caret package and the confusionMatrix function.
dput(head(wine_quality))
structure(list(fixed.acidity = c(7, 6.3, 8.1, 7.2, 7.2, 8.1),
volatile.acidity = c(0.27, 0.3, 0.28, 0.23, 0.23, 0.28),
citric.acid = c(0.36, 0.34, 0.4, 0.32, 0.32, 0.4), residual.sugar = c(20.7,
1.6, 6.9, 8.5, 8.5, 6.9), chlorides = c(0.045, 0.049, 0.05,
0.058, 0.058, 0.05), free.sulfur.dioxide = c(45, 14, 30,
47, 47, 30), total.sulfur.dioxide = c(170, 132, 97, 186,
186, 97), density = c(1.001, 0.994, 0.9951, 0.9956, 0.9956,
0.9951), pH = c(3, 3.3, 3.26, 3.19, 3.19, 3.26), sulphates = c(0.45,
0.49, 0.44, 0.4, 0.4, 0.44), alcohol = c(8.8, 9.5, 10.1,
9.9, 9.9, 10.1), quality = structure(c(4L, 4L, 4L, 4L, 4L,
4L), .Label = c("3", "4", "5", "6", "7", "8", "9", "white"
), class = "factor"), type = structure(c(3L, 3L, 3L, 3L,
3L, 3L), .Label = c("", "red", "white"), class = "factor"),
numeric_type = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 6L
), class = "data.frame")
library(tibble)
library(broom)
library(ggplot2)
library(caret)
any(is.na(wine_quality)) # this evaulates to FALSE
wine_model <- glm(type ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, wine_quality, family = "binomial")
# split data into test and train
smp_size <- floor(0.75 * nrow(wine_quality))
set.seed(123)
train_ind <- sample(seq_len(nrow(wine_quality)), size = smp_size)
train <- wine_quality[train_ind, ]
test <- wine_quality[-train_ind, ]
# make prediction on train data
pred <- predict(wine_model)
train$fixed.acidity <- as.numeric(train$fixed.acidity)
round(train$fixed.acidity)
train$fixed.acidity <- as.factor(train$fixed.acidity)
pred <- as.numeric(pred)
round(pred)
pred <- as.factor(pred)
confusionMatrix(pred, wine_quality$fixed.acidity)
After this final line of code, I get this error:
Error: `data` and `reference` should be factors with the same levels.
This error doesn't make sense to me. I've tested that the length of pred and length of fixed.acidity are both the same (6497) and also they are both factor data type.
length(pred)
length(wine_quality$fixed.acidity)
class(pred)
class(train$fixed.acidity)
Is there any obvious reason why this confusion matrix is not working? I'm trying to find a hit ratio for the model. I would appreciate dummy explanations I really don't know what I'm doing here.

The error from confusionMatrix() tells us that the two variables passed to the function need to be factors with the same values. We can see why we received the error when we run str() on both variables.
> str(pred)
Factor w/ 5318 levels "-23.6495182533792",..: 310 339 419 1105 310 353 1062 942 594 1272 ...
> str(wine_quality$fixed.acidity)
num [1:6497] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
pred is a factor, when wine_quality$fixed_acidity is a numeric vector. The confusionMatrix() function is used to compare predicted and actual values of a dependent variable. It is not intended to cross tabulate a predicted variable and an independent variable.
Code in the question uses fixed.acidity in the confusion matrix when it should be comparing predicted values of type against actual values of type from the testing data.
Also, the code in the question creates the model prior to splitting the data into test and training data. The correct procedure is to split the data before building a model on the training data, make predictions with the testing (hold back) data, and compare actuals to predictions in the testing data.
Finally, the result of the predict() function as coded in the original post is the linear predicted values from the GLM model (equivalent to wine_model$linear.predictors in the output model object). These values must be further transformed to make them suitable before use in confusionMatrix().
In practice, it's easier to use caret::train() with the GLM method and binomial family, where predict() will generate results that are usable in confusionMatrix(). We'll illustrate this with the UCI wine quality data.
First, we download the data from the UCI Machine Learning Repository to make the example reproducible.
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
"./data/wine_quality_red.csv")
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
"./data/wine_quality_white.csv")
Second, we load the data, assign type as either red or white depending on the data file, and bind the data into a single data frame.
red <- read.csv("./data/wine_quality_red.csv",header = TRUE,sep=";")
white <- read.csv("./data/wine_quality_white.csv",header = TRUE,sep=";")
red$type <- "red"
white$type <- "white"
wine_quality <- rbind(red,white)
wine_quality$type <- factor(wine_quality$type)
Next, we split the data into test and training based on values of type so each data frame gets a proportional number of red and white wines, train the data with the default caret::train() settings and a GLM method.
library(caret)
set.seed(123)
inTrain <- createDataPartition(wine_quality$type, p = 3/4)[[1]]
training <- wine_quality[ inTrain,]
testing <- wine_quality[-inTrain,]
aModel <- train(type ~ .,data = training, method="glm", familia's = "binomial")
Finally, we use the model to make predictions on the hold back data frame, and run a confusion matrix.
testLM <- predict(aModel,testing)
confusionMatrix(data=testLM,reference=testing$type)
...and the output:
> confusionMatrix(data=testLM,reference=testing$type)
Confusion Matrix and Statistics
Reference
Prediction red white
red 393 3
white 6 1221
Accuracy : 0.9945
95% CI : (0.9895, 0.9975)
No Information Rate : 0.7542
P-Value [Acc > NIR] : <2e-16
Kappa : 0.985
Mcnemar's Test P-Value : 0.505
Sensitivity : 0.9850
Specificity : 0.9975
Pos Pred Value : 0.9924
Neg Pred Value : 0.9951
Prevalence : 0.2458
Detection Rate : 0.2421
Detection Prevalence : 0.2440
Balanced Accuracy : 0.9913
'Positive' Class : red

Related

Is there a way to obtain residual plots for all interaction terms?

I am working on an exercise asking me "Plot the residuals against Y_hat, each predictor variable, and each two-factor interaction term on separate graphs." Here is a snippet of the data set I am using:
> dput(head(Commercial_Properties, 10))
structure(list(Rental_Rates = c(13.5, 12, 10.5, 15, 14, 10.5,
14, 16.5, 17.5, 16.5), Age = c(1, 14, 16, 4, 11, 15, 2, 1, 1,
8), Op_Expense_Tax = c(5.02, 8.19, 3, 10.7, 8.97, 9.45, 8, 6.62,
6.2, 11.78), Vacancy_Rate = c(0.14, 0.27, 0, 0.05, 0.07, 0.24,
0.19, 0.6, 0, 0.03), Total_Sq_Ft = c(123000, 104079, 39998, 57112,
60000, 101385, 31300, 248172, 215000, 251015), residuals = c(`1` = -1.03567244005944,
`2` = -1.51380641405037, `3` = -0.591053402133659, `4` = -0.133568082335235,
`5` = 0.313283765150399, `6` = -3.18718522392237, `7` = -0.538356748944345,
`8` = 0.236302385996349, `9` = 1.98922037248654, `10` = 0.105829602747806
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
From here I created the proper linear model that includes two factor interaction terms:
commercial_properties_lm_two_degree_interaction <-
lm(data=Commercial_Properties,
formula=Rental_Rates ~ (Age + Op_Expense_Tax + Vacancy_Rate + Total_Sq_Ft)^2)
Next what I was hoping to accomplish was to plot the residuals not just of the linear terms, but also of the interaction terms. I attempted to do this using the residualPlots() function in the car package
library(car)
residualPlots(model=commercial_properties_lm_two_degree_interaction,
terms=~ (Age + Op_Expense_Tax + Vacancy_Rate + Total_Sq_Ft)^2)
When applied in this way the output only produced the residual plots against the linear terms, it didn't plot any interactions. So I then attempted to do it manually, but I got an error:
residualPlots(model=commercial_properties_lm_two_degree_interaction,
terms=~ Age + Op_Expense_Tax + Vacancy_Rate + Tota_Sq_Ft +
Age:Op_Expense_Tax + Age:Vacancy_Rate)
Error in termsToMf(model, terms) : argument 'terms' not interpretable.
Now if I were to do things completely manually I was able to get an interaction plot for example:
with(data=Commercial_Properties, plot(x=Op_Expense_Tax * Vacancy_Rate, y=residuals))
plotted successfully. My issue is that sure I can do this completely manually for a reasonably small amount of variables, but it will get extremely tedious once the amount of variables begins to get larger.
So my question is if there is a way to use an already created function in R to make residual plots of the interaction terms or would I be left to doing it completely manually or most likely having to write some sort of loop ?
Note, I'm not asking about partial residuals. I haven't gotten to that point in my text I'm using. Just plain interaction terms against residuals.
You could do an eval(parse()) approach using the 'term.labels' attribute.
With gsub(':', '*', a[grep(':', a)]) pull out the interaction terms and replace : with * so it can be evaluated.
a <- attr(terms(commercial_properties_lm_two_degree_interaction), 'term.labels')
op <- par(mfrow=c(2, 3))
with(Commercial_Properties,
lapply(gsub(':', '*', a[grep(':', a)]), function(x)
plot(eval(parse(text=x)), residuals, xlab=x)))
par(op)
Edit
This is how we would do this with a for loop in R (but see comments below):
as <- gsub(':', '*', a[grep(':', a)])
op <- par(mfrow=c(2, 3))
for (x in as) {
with(Commercial_Properties,
plot(eval(parse(text=x)), residuals, xlab=x)
)
}
par(op)

Problem with post-hoc emmeans() test after lmerTest

I have a dataset looking at a response variable (Fat %), over time (Week 0-4), and over a treatment condition -- short vs long day.
I used a lmer model test to find out if the variables and interaction term were significant and it was significant. I want to look further at the interaction term (so basically a Tukey test but still accounting for the repeated measures). That's when I started to use the emmeans package and the output is not giving me the full output I would like. Any suggestions I would love, thank you.
here is my data set:
structure(list(`Bird ID` = c(61, 62, 71, 72, 73, 76, 77, 63,
64, 69), Day = c("long", "long", "long", "long", "long", "long",
"long", "short", "short", "short"), Week = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `Body Weight` = c(34.57, 41.05, 37.74, 37.04, 33.38,
35.6, 31.88, 34.32, 35.5, 35.78), `Fat %` = c(2.42718446601942,
2.07515423443634, 11.7329093799682, 8.61137591356848, 5.36031238906638,
7.9879679144385, 1.2263099219621, 5.17970401691332, 8.73096446700508,
3.62993896562801), `Lean %` = c(97.5728155339806, 97.9248457655636,
88.2670906200318, 91.3886240864315, 94.6396876109336, 92.0120320855615,
98.7736900780379, 94.8202959830867, 91.2690355329949, 96.370061034372
), `Fat(g)` = c(0.7, 0.74, 3.69, 2.71, 1.51, 2.39, 0.33, 1.47,
2.58, 1.13), `Lean(g)` = c(28.14, 34.92, 27.76, 28.76, 26.66,
27.53, 26.58, 26.91, 26.97, 30), ID = c(1, 2, 3, 4, 5, 6, 7,
8, 9, 10)), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
code I have tried:
model:
model3b <- lmer( `Fat %` ~ Day + Week + Day:Week + (1|ID), data=jussara_data)
summary(model3b)
resp <- jussara_data$`Fat %`
f1 <- jussara_data$Week
f2 <- jussara_data$Day
fit1 = lm(log(resp) ~ f1 + f2 + f1:f2, data = jussara_data)
emm1 = emmeans(fit1, specs = pairwise ~ f1:f2)
emm1$emmeans
emm1$contrasts
The contrasts function I was hoping it would give me the summary looking something like this (but I need the repeated measures included not just this anova analysis):
Fat % groups
4:short 32.065752 a
3:short 27.678036 a
2:short 21.358485 b
4:long 13.895404 c
1:short 13.138941 c
2:long 12.245741 c
3:long 12.138498 c
1:long 10.315978 cd
0:short 6.134327 d
0:long 5.631602 d
but instead only gave me this:
f1 f2 emmean SE df lower.CL upper.CL
2 long 2.24 0.0783 66 2.09 2.40
2 short 2.80 0.0783 66 2.64 2.95
Results are given on the log (not the response) scale.
Confidence level used: 0.95
contrast estimate SE df t.ratio p.value
2 long - 2 short -0.556 0.111 66 -5.025 <.0001
Results are given on the log (not the response) scale.
Thank you for the help!

How to plot the decision boundary for a Gaussian Naive Bayes classifier?

I use the toy dataset (class membership variable & 2 features) below to apply a Gaussian Naive Bayes model and plot the contours of the class-specific bivariate normal distributions.
How to add a line for the decision boundary to the plot below?
Like here:
(Image source: https://alliance.seas.upenn.edu/~cis520/dynamic/2016/wiki/uploads/Lectures/2class_gauss_NB.jpg)
# Packages
library(klaR)
library(MASS)
# Data
d <- structure(list(y = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"), x1 = c(2, 2.8, 1.5, 2.1, 5.5, 8, 6.9, 8.5, 2.5, 7.7), x2 = c(1.5, 1.2, 1, 1, 4, 4.8, 4.5, 5.5, 2, 3.5)), .Names = c("y", "x1", "x2"), row.names = c(NA, -10L), class = "data.frame")
# Naive Bayes Model
mN <- NaiveBayes(y ~ x1+x2, data = d)
# Data
# Class 1
m1 <- mean(d[which(d$y==1),]$x1)
m2 <- mean(d[which(d$y==1),]$x2)
mu1_2 <- c(m1,m2) # Mean
sd1 <- sd(d[which(d$y==1),]$x1)
sd2 <- sd(d[which(d$y==1),]$x2)
Sigma1_2 <- matrix(c(sd1, 0, 0, sd2), 2) # Covariance matrix
bivn1_2 <- mvrnorm(5000, mu = mu1_2, Sigma = Sigma1_2 ) # from Mass package: Simulate bivariate normal PDF
bivn1_2.kde <- kde2d(bivn1_2[,1], bivn1_2[,2], n = 50) # from MASS package: Calculate kernel density estimate
# Class 0
m3 <- mean(d[which(d$y==0),]$x1)
m4 <- mean(d[which(d$y==0),]$x2)
mu3_4 <- c(m3,m4) # Mean
sd3 <- sd(d[which(d$y==0),]$x1)
sd4 <- sd(d[which(d$y==0),]$x2)
Sigma3_4 <- matrix(c(sd3, 0, 0, sd4), 2) # Covariance matrix
bivn3_4 <- mvrnorm(5000, mu = mu3_4, Sigma = Sigma3_4 ) # from Mass package: Simulate bivariate normal PDF
bivn3_4.kde <- kde2d(bivn3_4[,1], bivn3_4[,2], n = 50) # from MASS package: Calculate kernel density estimate
# Plot
plot(x= d$x1, y=d$x2, xlim=c(-1,10), ylim=c(-1,10), col=d$y, pch=19, cex=2, ylab="x2", xlab="x1")
contour(bivn1_2.kde, add = TRUE, col="darkgrey") # from base graphics package
contour(bivn3_4.kde, add = TRUE, col="darkgrey") # from base graphics package
text(labels = "Class 1",x = 8, y=7, col="grey")
text(labels = "Class 0",x = 0, y=4, col="grey")

Plotting both a GLM and LM of same data

I would like to plot both a linear model (LM) and non-linear (GLM) model of the same data.
The range between 16% - 84% should line up between a LM and GLM, Citation: section 3.5
I have included a more complete chunk of the code because I am not sure at which point I should try to cut the linear model. or at which point I have messed up - I think with the linear model.
The code below results in the following image:
My Objective (taken from previous citation-link).
Here is my data:
mydata3 <- structure(list(
dose = c(0, 0, 0, 3, 3, 3, 7.5, 7.5, 7.5, 10, 10, 10, 25, 25, 25, 50, 50, 50),
total = c(25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L),
affected = c(1, 0, 1.2, 2.8, 4.8, 9, 2.8, 12.8, 8.6, 4.8, 4.4, 10.2, 6, 20, 14, 12.8, 23.4, 21.6),
probability = c(0.04, 0, 0.048, 0.112, 0.192, 0.36, 0.112, 0.512, 0.344, 0.192, 0.176, 0.408, 0.24, 0.8, 0.56, 0.512, 0.936, 0.864)),
.Names = c("dose", "total", "affected", "probability"),
row.names = c(NA, -18L),
class = "data.frame")
My script:
#load libraries
library(ggplot2)
library(drc) # glm model
library(plyr) # rename function
library(scales) #log plot scale
#Creating linear model
mod_linear <- lm(probability ~ (dose), weights = total, data = mydata3)
#Creating data.frame: note values 3 and 120 refer to 16% and 84% response in sigmoidal plot
line_df <-expand.grid(dose=exp(seq(log(3),log(120),length=200)))
#Extracting values from linear model
p_line_df <- as.data.frame(cbind(dose = line_df,
predict(mod_linear, newdata=data.frame(dose = line_df),
interval="confidence",level=0.95)))
#Renaming linear df columns
p_line_df <-rename(p_line_df, c("fit"="probability"))
p_line_df <-rename(p_line_df, c("lwr"="Lower"))
p_line_df <-rename(p_line_df, c("upr"="Upper"))
p_line_df$model <-"Linear"
#Create sigmoidal dose-response curve using drc package
mod3 <- drm(probability ~ (dose), weights = total, data = mydata3, type ="binomial", fct=LL.2(names=c("Slope:b","ED50:e")))
#data frame for ggplot2
base_DF_3 <-expand.grid(dose=exp(seq(log(1.0000001),log(10000),length=200)))
#extract data from model
p_df3 <- as.data.frame(cbind(dose = base_DF_3,
predict(mod3, newdata=data.frame(dose = base_DF_3),
interval="confidence", level=.95)))
#renaming columns
p_df3 <-rename(p_df3, c("Prediction"="probability"))
p_df3$model <-"Sigmoidal"
#combining Both DataFames
p_df_all <- rbind(p_df3, p_line_df)
#plotting
ggplot(p_df_all, aes(x=dose,y=probability, group=model))+
geom_line(aes(x=dose,y=probability,group=model,linetype=model),show.legend = TRUE)+
scale_x_log10(breaks = c(0.000001, 10^(0:10)),labels = c(0, math_format()(0:10)))
Looking at the reference you provided, what the authors describe is the use of a linear model to approximate the central portion of a (sigmoidal) logistic function. The linear model that achieves this is a straight line that passes through the inflection point of the logistic curve, and has the same slope as the logistic function at that inflection point. We can use some basic algebra and calculus to solve this problem.
From ?LL.2, we see that the form of the logistic function being fitted by drm is
f(x) = 1 / {1 + exp(b(log(x) - log(e)))}
We can get the values of the coefficient in this equation by
b = mod3$coefficients[1]
e = mod3$coefficients[2]
Now, by differentiation, the slope of the logistic function is given by
dy/dx = -(b * exp((log(x)-log(e))*b)) / (1+exp((log(x)-log(e))*b))^2
At the inflection point, the dose (x) is equal to the coefficient e, thus the slope at the inflection point simplifies (greatly) to
sl50 = -b/4
Since we also know that the inflection point occurs at the point where probability = 0.5 and dose = e, we can construct the straight line (in log-transformed coordinates) like this:
linear_probability = sl50 * (log(p_df3$dose) - log(e)) + 0.5
Now, to plot the logistic and linear functions together:
p_df3_lin = p_df3
p_df3_lin$model = 'linear'
p_df3_lin$probability = linear_probability
p_df_all <- rbind(p_df3, p_df3_lin)
ggplot(p_df_all, aes(x=dose,y=probability, group=model))+
geom_line(aes(x=dose,y=probability,group=model,linetype=model),show.legend = TRUE)+
scale_x_log10(breaks = c(0.000001, 10^(0:10)),labels = c(0, math_format()(0:10))) +
scale_y_continuous(limits = c(0,1))

Error messages when running glmer in R

I am attempting to run two similar generalized linear mixed models in R. Both models have the same input variables for predictors, covariates and random factors, however, response variables differ. Models require the lme4 package. The issue I was having with the second model has been resolved by Ben Bolker.
In the first model, the response variable is biomass weight and family = gaussian.
global.model <- lmer(ex.drywght ~ forestloss562*forestloss17*roaddenssec*nearestroadprim +
elevation + soilPC1 + soilPC2 +
(1|block/fragment),
data = RespPredComb,
family = "gaussian")
Predictors have the following units:
forestloss562 = %,
forestloss17 = %,
roaddenssec = (km/km2) and
nearestroadprim = (m).
Executing this model brings up the following warning messages:
Warning messages:
1: In glmer(ex.drywght ~ forestloss562 * forestloss17 * roaddenssec * :
calling glmer() with family=gaussian (identity link) as a shortcut to lmer() is deprecated; please call lmer() directly
2: Some predictor variables are on very different scales: consider rescaling
I then perform these subsequent steps (following the sequence of steps described in Grueber et al. (2011):
I standardize predictors,
stdz.model <- standardize(global.model, standardize.y = FALSE)
(requires package arm)
use automated model selection with subsets of the supplied ‘global’ model
model.set <- dredge(stdz.model)
requires package (MuMIn)
Here I get the following warning message:
Warning message:
In dredge(stdz.model2) : comparing models fitted by REML
find the top 2 AIC models and
top.models <- get.models(model.set, subset = delta < 2)
do model averaging
model.avg(model.set, subset = delta < 2)
Here, I get this error message:
Error in apply(apply(z, 2L, is.na), 2, all) :
dim(X) must have a positive length
Any advice on how to possibly fix this error would be very much appreciated.
In the second model, the response variable is richness, family is poisson.
global.model <- glmer(ex.richness ~ forestloss562*forestloss17*roaddenssec*nearestroadprim +
elevation + soilPC1 + soilPC2 +
(1|block/fragment),
data = mydata,
family = "poisson")
When I execute the above command I get the following error and warning messages:
Error: (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
In addition: Warning messages:
1: Some predictor variables are on very different scales: consider rescaling
2: In pwrssUpdate(pp, resp, tolPwrss, GQmat, compDev, fac, verbose) :
Cholmod warning 'not positive definite' at file:../Cholesky/t_cholmod_rowfac.c, line 431
3: In pwrssUpdate(pp, resp, tolPwrss, GQmat, compDev, fac, verbose) :
Cholmod warning 'not positive definite' at file:../Cholesky/t_cholmod_rowfac.c, line 431
Please find a reproducible subset of my data below:
structure(list(plot.code = structure(c(1L, 3L, 2L, 4L, 5L, 6L,
7L), .Label = c("a100m56r", "b1m177r", "c100m56r", "d1f1r", "e1m177r",
"f1m17r", "lf10m56r"), class = "factor"), site.code = structure(c(1L,
3L, 2L, 4L, 5L, 6L, 7L), .Label = c("a100m56", "b1m177", "c100m56",
"d1f1", "e1m177", "f1m17", "lf10m56"), class = "factor"), block = structure(c(1L,
3L, 2L, 4L, 5L, 6L, 7L), .Label = c("a", "b", "c", "d", "e",
"f", "lf"), class = "factor"), fragment = structure(c(1L, 3L,
2L, 4L, 5L, 6L, 7L), .Label = c("a100", "b1", "c100", "d1", "e1",
"f1", "lf10"), class = "factor"), elevation = c(309L, 342L, 435L,
495L, 443L, 465L, 421L), forestloss562 = c(25.9, 56.77, 5.32,
27.4, 24.25, 3.09, 8.06), forestloss17 = c(7.47, 51.93, 79.76,
70.41, 80.55, 0, 0), roaddenssec = c(2.99, 3.92, 2.61, 1.58,
1.49, 1.12, 1.16), nearestroadprim = c(438L, 237L, 2637L, 327L,
655L, 528L, 2473L), soilPC1 = c(0.31, -0.08, 1.67, 2.39, -1.33,
-1.84, -0.25), soilPC2 = c(0.4, 0.41, -0.16, 0.15, 0.03, -0.73,
0.51), ex.richness = c(0L, 0L, 1L, 7L, 0L, 0L, 1L), ex.drywght = c(0,
0, 1.255, 200.2825, 0, 0, 0.04)), .Names = c("plot.code", "site.code",
"block", "fragment", "elevation", "forestloss562", "forestloss17",
"roaddenssec", "nearestroadprim", "soilPC1", "soilPC2", "ex.richness",
"ex.drywght"), class = "data.frame", row.names = c(NA, -7L))
tl;dr you need to standardize your variables before you fit the model, for greater numerical stability. I also have a few comments about the advisability of what you're doing, but I'll save them for the end ...
source("SO_glmer_26904580_data.R")
library("arm")
library("lme4")
library("MuMIn")
Try the first fit:
pmod <- glmer(ex.richness ~
forestloss562*forestloss17*roaddenssec*nearestroadprim +
elevation + soilPC1 + soilPC2 +
(1|block/fragment),
data = dat,
family = "poisson")
This fails, as reported above.
However, I get a warning you didn't report above:
## 1: Some predictor variables are on very different scales: consider rescaling
which provides a clue.
Scaling numeric parameters:
pvars <- c("forestloss562","forestloss17",
"roaddenssec","nearestroadprim",
"elevation","soilPC1","soilPC2")
datsc <- dat
datsc[pvars] <- lapply(datsc[pvars],scale)
Try again:
pmod <- glmer(ex.richness ~
forestloss562*forestloss17*roaddenssec*nearestroadprim +
elevation + soilPC1 + soilPC2 +
(1|block/fragment),
data = datsc,
family = "poisson",
na.action="na.fail")
This works, although we get a warning message about a too-large gradient -- I think this is actually ignorable (we're still working on getting these error sensitivity thresholds right).
As far as I can tell, the following lines seem to be working:
stdz.model <- standardize(pmod, standardize.y = FALSE)
## increases max gradient -- larger warning
model.set <- dredge(stdz.model) ## slow, but running ...
Here are my comments about advisability:
Not even counting random-effects parameters, you have only 8x as many observations as predictor variables. This is pushing it (a rule of thumb is that you should have 10-20 observations per parameter).
nrow(datsc) ## 159
ncol(getME(pmod,"X")) ## 19
Dredging/multi-model-averaging over models with and without interactions can be dangerous -- at the very least, centering continuous variables is necessary in order for it to be interpretable. (I don't know whether dredge does anything to try to be sensible in this case.)
I also tried glmmLasso on this problem -- it ended up shrinking away all of the fixed effect terms ...
library("glmmLasso")
datsc$bf <- interaction(datsc$block,datsc$fragment)
glmmLasso(ex.richness ~
forestloss562+forestloss17+roaddenssec+nearestroadprim +
elevation + soilPC1 + soilPC2,
rnd=list(block=~1,bf=~1),
data = datsc,
family = poisson(),
lambda=500)

Resources