R: Bootstrap Multiple Regression - r

I am trying to run a multiple regression analysis to find the impact of water quality on plankton abundance in a specific location (aka Guzzler). I was able to get my model to run and the summary, however the data is non-parametric so a typical summary would not be reliable. This is mainly due to having a small sample size as it was done over the course of a few weeks and one sample each week.
I was then thinking the non-parametric version of this could be a bootstrap. I've run bootstraps on other data before but never a multiple regression model. I can't seem to find code on how to go about this so began with how I've performed bootstraps in the past. I was curious what I would need to edit in order to get this bootstrap to run.
Here is the output from dput(head(Guzzler1):
structure(list(Abundance = c(98L, 43L, 65L, 55L, 54L), Phospates = c(2L,
2L, 2L, 2L, 2L), Nitrates = c(0, 0.3, 0, 0.15, 0), pH = c(7.5,
8, 7.5, 7, 7)), .Names = c("Abundance", "Phospates", "Nitrates",
"pH"), row.names = c(NA, 5L), class = "data.frame")
Here is my model & the summary:
Guzzler1model<-lm(Abundance ~ Phospates + Nitrates + pH, data=Guzzler1)
> summary(Guzzler1model)
Call:
lm(formula = Abundance ~ Phospates + Nitrates + pH, data = Guzzler1)
Residuals:
1 2 3 4 5
20.75 -4.25 -12.25 8.50 -12.75
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -80.25 209.62 -0.383 0.739
Phospates NA NA NA NA
Nitrates -135.00 90.02 -1.500 0.272
pH 21.00 28.87 0.727 0.543
Residual standard error: 20.41 on 2 degrees of freedom
Multiple R-squared: 0.5302, Adjusted R-squared: 0.06032
F-statistic: 1.128 on 2 and 2 DF, p-value: 0.4698
***Please note: I believe phosphates have NA because each value was equal to 2 in this particular location.
Here is how I was originally performing a bootstrap and unsure what to change:
n=length(Guzzler1Abundance)
B = 1000
results = numeric(B)
for(b in 1:B){
i = sample(x = 1:n, size=n, replace=TRUE)
bootSample = Guzzler1Abundance[i]
thetahat= mean(bootSample)
results[b] = thetahat
}
Thank you so much in advance!

I am not quite shure what you mean by non-parametric data but I understand, you want to take bootstrapped samples from your data and perform linear regression with it.
A possible way to do that would be
Guzzler1 <- structure(list(Abundance = c(98L, 43L, 65L, 55L, 54L), Phospates = c(2L,
2L, 2L, 2L, 2L), Nitrates = c(0, 0.3, 0, 0.15, 0), pH = c(7.5,
8, 7.5, 7, 7)), .Names = c("Abundance", "Phospates", "Nitrates",
"pH"), row.names = c(NA, 5L), class = "data.frame")
lines <- nrow(Guzzler1)
replicate(5, lm(Abundance ~ Phospates + Nitrates + pH,
data=Guzzler1[sample(lines, replace = TRUE),])$coefficients)
This will report the coefficents from 5 linear regressions like this
> replicate(5, lm(Abundance ~ Phospates + Nitrates + pH,
+ data=Guzzler1[sample(lines, replace = TRUE),])$coefficients)
[,1] [,2] [,3] [,4] [,5]
(Intercept) 65.00000 145.000000 -408.0000 145.000000 -100
Phospates NA NA NA NA NA
Nitrates -73.33333 6.666667 -256.6667 6.666667 -110
pH NA -13.000000 66.0000 -13.000000 22
The number of 5 replicates can be chosen arbitrarily higher by changing the first argument of my replicate call. The many NA values are due to the scarce data, as #IRTFM predicted and explained. This will improve as more data is going to be sampled.
Let's sample 5000 bootstrap samples and investigate the distribution of the Nitrate coefficients:
reps <- replicate(5000, lm(Abundance ~ Phospates + Nitrates + pH,
data=Guzzler1[sample(lines, replace = TRUE),])$coefficients)
plot(table(reps["Nitrates",]))
plot(ecdf(reps["Nitrates",]))
quantile(reps["Nitrates",], c(.025, .25, .5, .75, .975), na.rm = TRUE)
Phosphates can be edited to this, one there are variant data:
boxplot(reps["(Intercept)",], reps["Nitrates",], reps["pH",]
, names = c("Intercept", "Nitrates", "pH"), ylab="bootstrapped coefficients")
abline(h=0, col="firebrick", lty=3)

Related

R data visualization: Is there a way to plot based on emmeans using ggplot?

I am trying to visualize my data separately as a bar graph and as a dot plot connected by a line.
The experimental design includes 2 treatments, 3 levels for each treatment, and 2 diets as independent variables and weight measurement as a dependent variable. Each sample (e.g. treatment "a" level "1" diet "l" is duplicated. Below is a sample data frame (the response variable values are simplified):
df <- data.frame(treatment=c('a','a','a','b','b','b','a','a','a','b','b','b',
'a','a','a','b','b','b','a','a','a','b','b','b',
'a','a','a','b','b','b','a','a','a','b','b','b',
'a','a','a','b','b','b','a','a','a','b','b','b'),
level=c(1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3,
1,2,3,1,2,3,1,2,3,1,2,3),
diet=c('l','l','l','l','l','l','h','h','h','h','h','h',
'l','l','l','l','l','l','h','h','h','h','h','h',
'l','l','l','l','l','l','h','h','h','h','h','h',
'l','l','l','l','l','l','h','h','h','h','h','h'),
rep=c(1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2),
weight=c(100,75,50,50,25,12.5,100,75,50,50,25,12.5,
100,75,50,50,25,12.5,100,75,50,50,25,12.5,
200,150,100,100,50,25,200,150,100,100,50,25,
200,150,100,100,50,25,200,150,100,100,50,25))
Using a linear mixed model, I see that treatment and level effects are individually significant.
fit_df <- lmer(weight ~ treatment*level*diet + (1|rep), data=df)
I have also run emmeans to see pairwise contrasts between each combination of treatment and level.
(emm_wt <- emmeans(fit_df, specs=pairwise~treatment*level))
Then, I want to visualize the result shown below in a bar graph and a dot plot connected by a line. For the bar graph, the y-axis is emmean, x-axis is treatment*level, and error bars show emmean±SE.
$emmeans
treatment level emmean SE df lower.CL upper.CL
a 1 150.0 7.98 27.7 133.64 166.4
b 1 75.0 7.98 27.7 58.64 91.4
a 2 112.5 7.98 27.7 96.14 128.9
b 2 37.5 7.98 27.7 21.14 53.9
a 3 75.0 7.98 27.7 58.64 91.4
b 3 18.8 7.98 27.7 2.39 35.1
Results are averaged over the levels of: diet
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
The code below produces something similar to what I am looking for, but I am not sure how to add a line connecting the dots by the treatment (a1 to a3 and b1 to b3)...
It would also be nice to assign colors by the treatment (e.g. red for a and blue for b).
plot(emm_wt[[1]],
CIs=TRUE,
PIs=TRUE,
comparisons=TRUE,
colors=c("black","dark grey","grey","red"),
alpha=0.05,
adjust="tukey") +
theme_bw() +
coord_flip()
If anybody has any insights as to how I could visualize this, please let me know. Thank you in advance!
You could do something like this, using ggplot2
library(ggplot2)
ggplot(df,aes(reorder(trt,level),emmean, group=treatment, color=treatment)) +
geom_line(size=2) +
scale_color_manual(values=c("a" = "red", "b"="blue")) +
geom_linerange(aes(ymin=lower.CL, ymax=upper.CL), size=2,show.legend = F) +
geom_point(color="black", size=8) +
ylim(0,200) + labs(x="Treatment/Level", color="Treatment") +
theme(legend.position="bottom")
Output:
Input:
df = structure(list(treatment = c("a", "b", "a", "b", "a", "b"), level = c(1L,
1L, 2L, 2L, 3L, 3L), emmean = c(150, 75, 112.5, 37.5, 75, 18.8
), SE = c(7.98, 7.98, 7.98, 7.98, 7.98, 7.98), df = c(27.7, 27.7,
27.7, 27.7, 27.7, 27.7), lower.CL = c(133.64, 58.64, 96.14, 21.14,
58.64, 2.39), upper.CL = c(166.4, 91.4, 128.9, 53.9, 91.4, 35.1
), trt = structure(c(1L, 4L, 2L, 5L, 3L, 6L), .Label = c("a1",
"a2", "a3", "b1", "b2", "b3"), class = c("ordered", "factor"))), row.names = c(NA,
-6L), class = "data.frame")

How to tell r the number of observations for a category

I am making a graph from a table of data from a paper. It has a column of categories of relationships, then two columns of numerical variables: the number of observations for each category and then the iq correlation:
relation num corr
spouse 3817 0.33
MZ-twin-tog 4671 0.86
MZ-twin-ap 65 0.72
DZ-twin-tog 5546 0.6
sib-tog 26473 0.47
sib-ap 203 0.24
off-par 8433 0.42
off-midpar 992 0.5
off-par-ap 814 0.22
I want to make a boxplot of (corr ~ relation) but I want the widths to be proportional to the number of observations for each category. Unfortuntately varwidth = TRUE won't work because I effectively just have one observation per category since I'm not working with the full data set.
Does anyone know how to work with this since I don't have the complete data, just the results.
P.S. I know boxplot is not exactly an appropriate graph for this limited data set, but I don't know how else to display (numerical ~ categorical). Suggestions are welcome!
Thank you in advance for any advice!
Data:
df1 <- structure(list(relation = structure(c(9L, 3L, 2L, 1L, 8L, 7L,
5L, 4L, 6L), .Label = c("DZ-twin-tog", "MZ-twin-ap", "MZ-twin-tog",
"off-midpar", "off-par", "off-par-ap", "sib-ap", "sib-tog", "spouse"
), class = "factor"), num = c(3817L, 4671L, 65L, 5546L, 26473L,
203L, 8433L, 992L, 814L), corr = c(0.33, 0.86, 0.72, 0.6, 0.47,
0.24, 0.42, 0.5, 0.22), num_pct = c(0.0748225977182734, 0.0915631003254009,
0.00127416003450033, 0.108715254635982, 0.518935978358882, 0.00397929980005489,
0.165307562629866, 0.019445642372682, 0.015956404124358)), .Names = c("relation",
"num", "corr", "num_pct"), row.names = c(NA, -9L), class = "data.frame")
Consider a bar plot like this (I mapped corr to color on both plots):
require(ggplot2)
g1 <- ggplot(df1, aes(relation, num))+
geom_bar(aes(fill=corr),stat="identity")+
theme_bw()
Or plotting the percent of each answer:
First calculate the percents:
df1$num_pct <- df1$num/sum(df1$num)
Then plot:
g2 <- ggplot(df1, aes(relation, num_pct))+
geom_bar(aes(fill=corr),stat="identity")+
scale_y_continuous(labels=scales::percent)+
theme_bw()

calculate gaussian curve fitting on a list

I have a list data like below. I want to perform nonlinear regression Gaussian curve fitting between mids and counts for each element of my list and report mean and standard deviation
mylist<- structure(list(A = structure(list(breaks = c(-10, -9,
-8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4), counts = c(1L,
0L, 1L, 5L, 9L, 38L, 56L, 105L, 529L, 2858L, 17L, 2L, 0L, 2L),
density = c(0.000276014352746343, 0, 0.000276014352746343,
0.00138007176373171, 0.00248412917471709, 0.010488545404361,
0.0154568037537952, 0.028981507038366, 0.146011592602815,
0.788849020149048, 0.00469224399668783, 0.000552028705492686,
0, 0.000552028705492686), mids = c(-9.5, -8.5, -7.5, -6.5,
-5.5, -4.5, -3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5),
xname = "x", equidist = TRUE), .Names = c("breaks", "counts",
"density", "mids", "xname", "equidist"), class = "histogram"),
B = structure(list(breaks = c(-7, -6, -5,
-4, -3, -2, -1, 0), counts = c(2L, 0L, 6L, 2L, 2L, 1L, 3L
), density = c(0.125, 0, 0.375, 0.125, 0.125, 0.0625, 0.1875
), mids = c(-6.5, -5.5, -4.5, -3.5, -2.5, -1.5, -0.5), xname = "x",
equidist = TRUE), .Names = c("breaks", "counts", "density",
"mids", "xname", "equidist"), class = "histogram"), C = structure(list(
breaks = c(-7, -6, -5, -4, -3, -2, -1, 0, 1), counts = c(2L,
2L, 4L, 5L, 14L, 22L, 110L, 3L), density = c(0.0123456790123457,
0.0123456790123457, 0.0246913580246914, 0.0308641975308642,
0.0864197530864197, 0.135802469135802, 0.679012345679012,
0.0185185185185185), mids = c(-6.5, -5.5, -4.5, -3.5,
-2.5, -1.5, -0.5, 0.5), xname = "x", equidist = TRUE), .Names = c("breaks",
"counts", "density", "mids", "xname", "equidist"), class = "histogram")), .Names = c("A",
"B", "C"))
I have read this
Fitting a density curve to a histogram in R
but this is how to fit a curve to a histogram. what I want is Best-fit values"
" Mean"
" SD"
If I use PRISM to do it, I should get the following results
for A
Mids Counts
-9.5 1
-8.5 0
-7.5 1
-6.5 5
-5.5 9
-4.5 38
-3.5 56
-2.5 105
-1.5 529
-0.5 2858
0.5 17
1.5 2
2.5 0
3.5 2
performing nonlinear regression Gaussian curve fitting , I get
"Best-fit values"
" Amplitude" 3537
" Mean" -0.751
" SD" 0.3842
for the second set
B
Mids Counts
-6.5 2
-5.5 0
-4.5 6
-3.5 2
-2.5 2
-1.5 1
-0.5 3
"Best-fit values"
" Amplitude" 7.672
" Mean" -4.2
" SD" 0.4275
and for the third one
Mids Counts
-6.5 2
-5.5 2
-4.5 4
-3.5 5
-2.5 14
-1.5 22
-0.5 110
0.5 3
I get this
"Best-fit values"
" Amplitude" 120.7
" Mean" -0.6893
" SD" 0.4397
In order to convert the histogram back to the estimate of the mean and standard deviation. First convert the results of the bin counts times the bin. This will be an approximation of the original data.
Based on your example above:
#extract the mid points and create list of simulated data
simdata<-lapply(mylist, function(x){rep(x$mids, x$counts)})
#if the original data were integers then this may give a better estimate
#simdata<-lapply(mylist, function(x){rep(x$breaks[-1], x$counts)})
#find the mean and sd of simulated data
means<-lapply(simdata, mean)
sds<-lapply(simdata, sd)
#or use sapply in the above 2 lines depending on future process needs
If your data was integers then using the breaks as the bins will provide a better estimate. Depending on the function for the histogram (ie right=TRUE/FALSE) may shift the results by one.
Edit
I thought this was going to be an easy one. I reviewed the video, the sample data shown was:
mids<-seq(-7, 7)
counts<-c(7, 1, 2, 2, 2, 5, 217, 70, 18, 0, 2, 1, 2, 0, 1)
simdata<-rep(mids, counts)
The video results were mean = -0.7359 and sd= 0.4571. The solution which I found provided the closest results was using the "fitdistrplus" package:
fitdist(simdata, "norm", "mge")
Using the "maximizing goodness-of-fit estimation" resulted in mean = -0.7597280 and sd= 0.8320465.
At this point, the method above provides a close estimate but does not exactly match. I don't not know what technique was used to calculate the fit from the video.
Edit #2
The above solutions involved recreating the original data and fitting that using either the mean/sd or using the fitdistrplus package. This attempt is an attempt to perform a least-square fit using the Gaussian distribution.
simdata<-lapply(mylist, function(x){rep(x$mids, x$counts)})
means<-sapply(simdata, mean)
sds<-sapply(simdata, sd)
#Data from video
#mids<-seq(-7, 7)
#counts<-c(7, 1, 2, 2, 2, 5, 217, 70, 18, 0, 2, 1, 2, 0, 1)
#make list of the bins and distribution in each bin
mids<-lapply(mylist, function(x){x$mids})
dis<-lapply(mylist, function(x) {x$counts/sum(x$counts)})
#function to perform the least square fit
nnorm<-function(values, mids, dis) {
means<-values[1]
sds<-values[2]
#print(paste(means, sds))
#calculate out the Gaussian distribution for each bin
modeld<-dnorm(mids, means, sds)
#sum of the squares
diff<-sum( (modeld-dis)^2)
diff
}
#use optim function with the mean and sd as initial guesses
#find the mininium with the mean and SD as fit parameters
lapply(1:3, function(i) {optim(c(means[[i]], sds[[i]]), nnorm, mids=mids[[i]], dis=dis[[i]])})
This solution provides a closer answer to PRISM results, but still not the same. Here is a comparison of all the 4 solutions.
From the table, the least square fit (the one just above) provides the closest approximation. Maybe tweaking the mid points dnorm function might help. But Case B data is farthest from being normally distributed but the PRISM software still generates a small standard deviation, while the other methods are similar. It is possible the PRISM software performs some type of data filtering to remove the outliers before the fit.

Implementing Tabu Search in R

I am trying to implement Tabu Search on a classification dataset namely Indian patients liver disease available in the UCI repository on https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset) but facing issues.
Following is the code I've used
NF <- 10
NTR <- 193
NTE <- 193
library(class)
library(e1071)
library(caret)
library(party)
library(nnet)
ILPD <- read.csv("C:/Users/Dell/Desktop/Codes and Datasets/ILPD.csv")
nrow(ILPD)
set.seed(9850)
gp<-runif(nrow(ILPD))
ILPD<-ILPD[order(gp),]
idx <- createDataPartition(y = ILPD$Class, p = 0.7, list = FALSE)
train<-ILPD[idx,]
test<-ILPD[-idx,]
ver<-test[,11]
evaluate <- function(th){
if (sum(th) == 0)return(0)
model <- svm(train[ ,th==1], train[,11] , gamma = 0.1, kernel ="sigmoid", na.action = na.omit)
pred <- predict(model, test[ ,th==1])
csRate <- sum(pred == ver)/NTE
penalty <- (NF - sum(th))/NF
return(csRate + penalty)
}
library(tabuSearch)
res <- tabuSearch(size = NF, iters = 2, objFunc = evaluate, config = matrix(1,1,NF), listSize = 5, nRestarts = 4)
plot(res)
plot(res, "tracePlot")
summary(res, verbose = TRUE)
Error:
Error in if (any(co)) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In FUN(newX[, i], ...) : NAs introduced by coercion
Called from: svm.default(train[, th == 1], train[, 11], gamma = 0.1, kernel = "sigmoid", na.action = na.omit)
Some part of the data
structure(list(age = c(55L, 48L, 14L, 17L, 40L, 37L), gender = c(0L,
0L, 0L, 0L, 1L, 0L), TB = c(0.9, 2.4, 0.9, 0.9, 0.9, 0.7), DB = c(0.2,
1.1, 0.3, 0.2, 0.3, 0.2), Alkphos = c(116L, 554L, 310L, 224L,
293L, 235L), SGPT = c(36L, 141L, 21L, 36L, 232L, 96L), sgot = c(16L,
73L, 16L, 45L, 245L, 54L), TP = c(6.2, 7.5, 8.1, 6.9, 6.8, 9.5
), ALB = c(3.2, 3.6, 4.2, 4.2, 3.1, 4.9), AG = c(1, 0.9, 1, 1.55,
0.8, 1), Class = structure(c(2L, 1L, 2L, 1L, 1L, 1L), .Label = c("One",
"Two"), class = "factor")), .Names = c("age", "gender", "TB",
"DB", "Alkphos", "SGPT", "sgot", "TP", "ALB", "AG", "Class"), row.names = c(216L,
405L, 316L, 103L, 20L, 268L), class = "data.frame")
If anyone could help me with it
I wanted to see how tabu worked anyway so seemed a good place to start.
Basically you need to test your code better, evaluate just did not work. It is easy to test by hand by creating values of th and then calling evaluate on them.
Also use high level comments to organize your code and keep track of what you are doing, especially when posting to SO for help so as to save us time figuring out what you intend.
Not sure if these results are good, the amount of data is so minimal it is hard to tell.
Anyway here is the changed code:
NF <- 10
NTR <- 193
NTE <- 193
library(class)
library(e1071)
library(caret)
library(party)
library(nnet)
ILPD1 <- structure(
list(
age = c(55L,48L,14L,17L,40L,37L),
gender = c(0L,0L,0L,0L,1L,0L),
TB = c(0.9,2.4,0.9,0.9,0.9,0.7),
DB = c(0.2,1.1,0.3,0.2,0.3,0.2),
Alkphos = c(116L,554L,310L,224L,293L,235L),
SGPT = c(36L,141L,21L,36L,232L,96L),
sgot = c(16L,73L,16L,45L,245L,54L),
TP = c(6.2,7.5,8.1,6.9,6.8,9.5),
ALB = c(3.2,3.6,4.2,4.2,3.1,4.9),
AG = c(1,0.9,1,1.55,0.8,1),
Class = structure(c(2L,1L,2L,1L,1L,1L),
.Label = c("One","Two"),
class = "factor")
),
.Names = c("age","gender","TB","DB","Alkphos",
"SGPT","sgot","TP","ALB","AG","Class"),
row.names = c(216L,405L,316L,103L,20L,268L),
class = "data.frame"
)
ILPD <- ILPD1
#ILPD <- read.csv("ILPD.csv")
nrow(ILPD)
set.seed(9850)
# setup test and training data
gp <- runif(nrow(ILPD))
ILPD <- ILPD[order(gp),]
idx <- createDataPartition(y = ILPD$Class,p = 0.7,list = FALSE)
train <- ILPD[idx,]
test <- ILPD[ - idx,]
ver <- test[,11]
evaluate <- function(th) {
# evaluate the tabu for a value of th
# tabuSearch will use this function to evaluate points in its search space
#
# if everything is turned off just return zero as we are not interested
if(sum(th) == 0) return(0)
# we just want to train our svm on the columns for which th==1
svmtrn <- train[,th==1]
# but we need to have the Class varible as our label
if (is.null(trn$Class)) return(0)
# Train up an svm now
# Note that the first argument is the forumula we are training
model <- svm(Class~.,svmtrn,gamma = 0.1,kernel = "sigmoid",na.action = na.omit)
pred <- predict(model,test)
# now evaluate how well our prediction worked
csRate <- sum(pred == ver) / NTE
penalty <- (NF - sum(th)) / NF
return(csRate + penalty)
}
library(tabuSearch)
evaluate(matrix(1,1,NF))
res <- tabuSearch(size = NF,iters = 2,objFunc = evaluate,
config = matrix(1,1,NF),listSize = 5,nRestarts = 4)
plot(res)
plot(res,"tracePlot")
summary(res,verbose = TRUE)
Here are the output results:
[1] 6
[1] 0.005181347
Tabu Settings
Type = binary configuration
No of algorithm repeats = 1
No of iterations at each prelim search = 2
Total no of iterations = 12
No of unique best configurations = 8
Tabu list size = 5
Configuration length = 10
No of neighbours visited at each iteration = 10
Results:
Highest value of objective fn = 0.70518
Occurs # of times = 1
Optimum number of variables = 3
Optimum configuration:
[1] 1 0 0 0 0 1 0 0 0 1
And here is your plot:

Error messages when running glmer in R

I am attempting to run two similar generalized linear mixed models in R. Both models have the same input variables for predictors, covariates and random factors, however, response variables differ. Models require the lme4 package. The issue I was having with the second model has been resolved by Ben Bolker.
In the first model, the response variable is biomass weight and family = gaussian.
global.model <- lmer(ex.drywght ~ forestloss562*forestloss17*roaddenssec*nearestroadprim +
elevation + soilPC1 + soilPC2 +
(1|block/fragment),
data = RespPredComb,
family = "gaussian")
Predictors have the following units:
forestloss562 = %,
forestloss17 = %,
roaddenssec = (km/km2) and
nearestroadprim = (m).
Executing this model brings up the following warning messages:
Warning messages:
1: In glmer(ex.drywght ~ forestloss562 * forestloss17 * roaddenssec * :
calling glmer() with family=gaussian (identity link) as a shortcut to lmer() is deprecated; please call lmer() directly
2: Some predictor variables are on very different scales: consider rescaling
I then perform these subsequent steps (following the sequence of steps described in Grueber et al. (2011):
I standardize predictors,
stdz.model <- standardize(global.model, standardize.y = FALSE)
(requires package arm)
use automated model selection with subsets of the supplied ‘global’ model
model.set <- dredge(stdz.model)
requires package (MuMIn)
Here I get the following warning message:
Warning message:
In dredge(stdz.model2) : comparing models fitted by REML
find the top 2 AIC models and
top.models <- get.models(model.set, subset = delta < 2)
do model averaging
model.avg(model.set, subset = delta < 2)
Here, I get this error message:
Error in apply(apply(z, 2L, is.na), 2, all) :
dim(X) must have a positive length
Any advice on how to possibly fix this error would be very much appreciated.
In the second model, the response variable is richness, family is poisson.
global.model <- glmer(ex.richness ~ forestloss562*forestloss17*roaddenssec*nearestroadprim +
elevation + soilPC1 + soilPC2 +
(1|block/fragment),
data = mydata,
family = "poisson")
When I execute the above command I get the following error and warning messages:
Error: (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
In addition: Warning messages:
1: Some predictor variables are on very different scales: consider rescaling
2: In pwrssUpdate(pp, resp, tolPwrss, GQmat, compDev, fac, verbose) :
Cholmod warning 'not positive definite' at file:../Cholesky/t_cholmod_rowfac.c, line 431
3: In pwrssUpdate(pp, resp, tolPwrss, GQmat, compDev, fac, verbose) :
Cholmod warning 'not positive definite' at file:../Cholesky/t_cholmod_rowfac.c, line 431
Please find a reproducible subset of my data below:
structure(list(plot.code = structure(c(1L, 3L, 2L, 4L, 5L, 6L,
7L), .Label = c("a100m56r", "b1m177r", "c100m56r", "d1f1r", "e1m177r",
"f1m17r", "lf10m56r"), class = "factor"), site.code = structure(c(1L,
3L, 2L, 4L, 5L, 6L, 7L), .Label = c("a100m56", "b1m177", "c100m56",
"d1f1", "e1m177", "f1m17", "lf10m56"), class = "factor"), block = structure(c(1L,
3L, 2L, 4L, 5L, 6L, 7L), .Label = c("a", "b", "c", "d", "e",
"f", "lf"), class = "factor"), fragment = structure(c(1L, 3L,
2L, 4L, 5L, 6L, 7L), .Label = c("a100", "b1", "c100", "d1", "e1",
"f1", "lf10"), class = "factor"), elevation = c(309L, 342L, 435L,
495L, 443L, 465L, 421L), forestloss562 = c(25.9, 56.77, 5.32,
27.4, 24.25, 3.09, 8.06), forestloss17 = c(7.47, 51.93, 79.76,
70.41, 80.55, 0, 0), roaddenssec = c(2.99, 3.92, 2.61, 1.58,
1.49, 1.12, 1.16), nearestroadprim = c(438L, 237L, 2637L, 327L,
655L, 528L, 2473L), soilPC1 = c(0.31, -0.08, 1.67, 2.39, -1.33,
-1.84, -0.25), soilPC2 = c(0.4, 0.41, -0.16, 0.15, 0.03, -0.73,
0.51), ex.richness = c(0L, 0L, 1L, 7L, 0L, 0L, 1L), ex.drywght = c(0,
0, 1.255, 200.2825, 0, 0, 0.04)), .Names = c("plot.code", "site.code",
"block", "fragment", "elevation", "forestloss562", "forestloss17",
"roaddenssec", "nearestroadprim", "soilPC1", "soilPC2", "ex.richness",
"ex.drywght"), class = "data.frame", row.names = c(NA, -7L))
tl;dr you need to standardize your variables before you fit the model, for greater numerical stability. I also have a few comments about the advisability of what you're doing, but I'll save them for the end ...
source("SO_glmer_26904580_data.R")
library("arm")
library("lme4")
library("MuMIn")
Try the first fit:
pmod <- glmer(ex.richness ~
forestloss562*forestloss17*roaddenssec*nearestroadprim +
elevation + soilPC1 + soilPC2 +
(1|block/fragment),
data = dat,
family = "poisson")
This fails, as reported above.
However, I get a warning you didn't report above:
## 1: Some predictor variables are on very different scales: consider rescaling
which provides a clue.
Scaling numeric parameters:
pvars <- c("forestloss562","forestloss17",
"roaddenssec","nearestroadprim",
"elevation","soilPC1","soilPC2")
datsc <- dat
datsc[pvars] <- lapply(datsc[pvars],scale)
Try again:
pmod <- glmer(ex.richness ~
forestloss562*forestloss17*roaddenssec*nearestroadprim +
elevation + soilPC1 + soilPC2 +
(1|block/fragment),
data = datsc,
family = "poisson",
na.action="na.fail")
This works, although we get a warning message about a too-large gradient -- I think this is actually ignorable (we're still working on getting these error sensitivity thresholds right).
As far as I can tell, the following lines seem to be working:
stdz.model <- standardize(pmod, standardize.y = FALSE)
## increases max gradient -- larger warning
model.set <- dredge(stdz.model) ## slow, but running ...
Here are my comments about advisability:
Not even counting random-effects parameters, you have only 8x as many observations as predictor variables. This is pushing it (a rule of thumb is that you should have 10-20 observations per parameter).
nrow(datsc) ## 159
ncol(getME(pmod,"X")) ## 19
Dredging/multi-model-averaging over models with and without interactions can be dangerous -- at the very least, centering continuous variables is necessary in order for it to be interpretable. (I don't know whether dredge does anything to try to be sensible in this case.)
I also tried glmmLasso on this problem -- it ended up shrinking away all of the fixed effect terms ...
library("glmmLasso")
datsc$bf <- interaction(datsc$block,datsc$fragment)
glmmLasso(ex.richness ~
forestloss562+forestloss17+roaddenssec+nearestroadprim +
elevation + soilPC1 + soilPC2,
rnd=list(block=~1,bf=~1),
data = datsc,
family = poisson(),
lambda=500)

Resources