I am trying to implement Tabu Search on a classification dataset namely Indian patients liver disease available in the UCI repository on https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset) but facing issues.
Following is the code I've used
NF <- 10
NTR <- 193
NTE <- 193
library(class)
library(e1071)
library(caret)
library(party)
library(nnet)
ILPD <- read.csv("C:/Users/Dell/Desktop/Codes and Datasets/ILPD.csv")
nrow(ILPD)
set.seed(9850)
gp<-runif(nrow(ILPD))
ILPD<-ILPD[order(gp),]
idx <- createDataPartition(y = ILPD$Class, p = 0.7, list = FALSE)
train<-ILPD[idx,]
test<-ILPD[-idx,]
ver<-test[,11]
evaluate <- function(th){
if (sum(th) == 0)return(0)
model <- svm(train[ ,th==1], train[,11] , gamma = 0.1, kernel ="sigmoid", na.action = na.omit)
pred <- predict(model, test[ ,th==1])
csRate <- sum(pred == ver)/NTE
penalty <- (NF - sum(th))/NF
return(csRate + penalty)
}
library(tabuSearch)
res <- tabuSearch(size = NF, iters = 2, objFunc = evaluate, config = matrix(1,1,NF), listSize = 5, nRestarts = 4)
plot(res)
plot(res, "tracePlot")
summary(res, verbose = TRUE)
Error:
Error in if (any(co)) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In FUN(newX[, i], ...) : NAs introduced by coercion
Called from: svm.default(train[, th == 1], train[, 11], gamma = 0.1, kernel = "sigmoid", na.action = na.omit)
Some part of the data
structure(list(age = c(55L, 48L, 14L, 17L, 40L, 37L), gender = c(0L,
0L, 0L, 0L, 1L, 0L), TB = c(0.9, 2.4, 0.9, 0.9, 0.9, 0.7), DB = c(0.2,
1.1, 0.3, 0.2, 0.3, 0.2), Alkphos = c(116L, 554L, 310L, 224L,
293L, 235L), SGPT = c(36L, 141L, 21L, 36L, 232L, 96L), sgot = c(16L,
73L, 16L, 45L, 245L, 54L), TP = c(6.2, 7.5, 8.1, 6.9, 6.8, 9.5
), ALB = c(3.2, 3.6, 4.2, 4.2, 3.1, 4.9), AG = c(1, 0.9, 1, 1.55,
0.8, 1), Class = structure(c(2L, 1L, 2L, 1L, 1L, 1L), .Label = c("One",
"Two"), class = "factor")), .Names = c("age", "gender", "TB",
"DB", "Alkphos", "SGPT", "sgot", "TP", "ALB", "AG", "Class"), row.names = c(216L,
405L, 316L, 103L, 20L, 268L), class = "data.frame")
If anyone could help me with it
I wanted to see how tabu worked anyway so seemed a good place to start.
Basically you need to test your code better, evaluate just did not work. It is easy to test by hand by creating values of th and then calling evaluate on them.
Also use high level comments to organize your code and keep track of what you are doing, especially when posting to SO for help so as to save us time figuring out what you intend.
Not sure if these results are good, the amount of data is so minimal it is hard to tell.
Anyway here is the changed code:
NF <- 10
NTR <- 193
NTE <- 193
library(class)
library(e1071)
library(caret)
library(party)
library(nnet)
ILPD1 <- structure(
list(
age = c(55L,48L,14L,17L,40L,37L),
gender = c(0L,0L,0L,0L,1L,0L),
TB = c(0.9,2.4,0.9,0.9,0.9,0.7),
DB = c(0.2,1.1,0.3,0.2,0.3,0.2),
Alkphos = c(116L,554L,310L,224L,293L,235L),
SGPT = c(36L,141L,21L,36L,232L,96L),
sgot = c(16L,73L,16L,45L,245L,54L),
TP = c(6.2,7.5,8.1,6.9,6.8,9.5),
ALB = c(3.2,3.6,4.2,4.2,3.1,4.9),
AG = c(1,0.9,1,1.55,0.8,1),
Class = structure(c(2L,1L,2L,1L,1L,1L),
.Label = c("One","Two"),
class = "factor")
),
.Names = c("age","gender","TB","DB","Alkphos",
"SGPT","sgot","TP","ALB","AG","Class"),
row.names = c(216L,405L,316L,103L,20L,268L),
class = "data.frame"
)
ILPD <- ILPD1
#ILPD <- read.csv("ILPD.csv")
nrow(ILPD)
set.seed(9850)
# setup test and training data
gp <- runif(nrow(ILPD))
ILPD <- ILPD[order(gp),]
idx <- createDataPartition(y = ILPD$Class,p = 0.7,list = FALSE)
train <- ILPD[idx,]
test <- ILPD[ - idx,]
ver <- test[,11]
evaluate <- function(th) {
# evaluate the tabu for a value of th
# tabuSearch will use this function to evaluate points in its search space
#
# if everything is turned off just return zero as we are not interested
if(sum(th) == 0) return(0)
# we just want to train our svm on the columns for which th==1
svmtrn <- train[,th==1]
# but we need to have the Class varible as our label
if (is.null(trn$Class)) return(0)
# Train up an svm now
# Note that the first argument is the forumula we are training
model <- svm(Class~.,svmtrn,gamma = 0.1,kernel = "sigmoid",na.action = na.omit)
pred <- predict(model,test)
# now evaluate how well our prediction worked
csRate <- sum(pred == ver) / NTE
penalty <- (NF - sum(th)) / NF
return(csRate + penalty)
}
library(tabuSearch)
evaluate(matrix(1,1,NF))
res <- tabuSearch(size = NF,iters = 2,objFunc = evaluate,
config = matrix(1,1,NF),listSize = 5,nRestarts = 4)
plot(res)
plot(res,"tracePlot")
summary(res,verbose = TRUE)
Here are the output results:
[1] 6
[1] 0.005181347
Tabu Settings
Type = binary configuration
No of algorithm repeats = 1
No of iterations at each prelim search = 2
Total no of iterations = 12
No of unique best configurations = 8
Tabu list size = 5
Configuration length = 10
No of neighbours visited at each iteration = 10
Results:
Highest value of objective fn = 0.70518
Occurs # of times = 1
Optimum number of variables = 3
Optimum configuration:
[1] 1 0 0 0 0 1 0 0 0 1
And here is your plot:
Related
Working in R on genomic data.
I'm trying to subset a very large melted phyloseq table, which includes a column of phylum IDs, in order to remove rows containing phyla that occur less than 100000 times in the table. I might have missed an "easy" way to do this, but I eventually ended up trying to make my own function.
The function:
phylum_subset <- function(x = melt.ALKSS_few, #melted physeq object
Count = melt.ALKSS$Phylum, #counting phyla
Value = 1000 #minimum number of OTUs
){
phyla.table <- table(x$Count)
for(Count in x){if(phyla.table[Count]<=100000)
subset(x,Phylum != Count)
}
}
I will grant that this is my first time writing a function and I don't really know what I'm doing.
My function input and resulting error output ends up like so:
melt.ALKSS_few.count <- phylum_subset(x = melt.ALKSS_few,Count = melt.ALKSS_few$Phylum,Value = 100000)
Error in if (phyla.table[Count] <= 1e+05) subset(x, Count != Phylum_col) :
the condition has length > 1
Because I'm trying to subset by a sum of occurences in a column, across all occurences in that column, I couldn't just use filter() or something once (unless I wanted to do that 500 times). Surely someone has done something like this before?
Edit: OK, trying to provide a reproducible chunk of my dataset. Be warned, it's got over 808k obs of 47 variables because doing genomics on an ecological dataset is a mess. I've removed some variables that are remnants of metadata for previous steps (primer sequences, etc.) that I won't be using in analysis just to keep the code block... less massive.
> dput(droplevels(head(melt.ALKSS_few)))
structure(list(OTU = c("44c21e29adae97a53247abbd73978395", "0f18144d308ada95632ab5193d92073f",
"d829bee4984f82ffc2453212157caf96", "0f18144d308ada95632ab5193d92073f",
"0ddcd311e02f742e2e0e61ce02cf9c29", "120eba657e42a11a5c29f97b90f02035"
), Sample = c("S438", "S680", "S437", "S345", "S454", "S513"),
Abundance = c(10755, 9568, 8186, 7621, 7506, 7501), BarcodeSequence = c("CATTTTAGGACT",
"CGGAATAGAGTA", "CATTTTAGAGTA", "TATAATGGACCA", "CGGAATTGGCAT",
"GACGACGGACCA"), PrimerDesc = c("16S",
"16S", "16S", "16S", "16S", "16S"), SampleName = c("06222021KC-2-R",
"09292021KC-2-R", "06222021KC-1-R", "06032021KC-1-R", "06292921KC-3-R",
"06302021KC-3-R"), Project = c("16SLBSKR1-", "16SLBSKR2-",
"16SLBSKR1-", "16SLBSKR1-", "16SLBSKR1-", "16SLBSKR2-"),
Number = c("456", "694", "455", "363", "471", "491"), Date = c("6_22_2021", "9_29_2021", "6_22_2021", "6_3_2021",
"6_29_2021", "6_30_2021"), Year = c(2021L, 2021L, 2021L,
2021L, 2021L, 2021L), Season = c("Summer", "Fall", "Summer",
"Summer", "Summer", "Summer"), sample_Species = c("Little_Bluestem",
"Little_Bluestem", "Little_Bluestem", "Little_Bluestem",
"Little_Bluestem", "Little_Bluestem"), SoloOrMixed = c("Solo",
"Mixed", "Solo", "Mixed", "Mixed", "Solo"), Location = c("Tyler_SP", "Hy_180", "Tyler_SP", "Roadside_Hy67",
"Copper_Breaks_SP", "Caprock_Canyons_SP"), Ecoregion = c("South_Central_Plains",
"South_Central_Plains", "South_Central_Plains", "Edwards_Plateau",
"Southwestern_Tablelands", "Southwestern_Tablelands"), Habitat = c("Forest",
"Roadside", "Forest", "Roadside", "Roadside", "AridRock"),
Source = c("Root", "Root", "Root", "Root", "Root", "Root"
), PrecipMonth = c(96.65,
37.45, 96.65, 125.94, 125.01, 153.94), PrecipDaysSince = c(1L,
1L, 1L, 1L, 1L, 0L), pH = c(6.8, 6.7, 6.8, 8, 8, 7.8), EC = c(139L,
182L, 139L, 161L, 125L, 2370L), NO3 = c(0, 4.4, 0, 0.2, 2.2,
3.4), P = c(16L, 17L, 16L, 14L, 5L, 6L), K = c(145L, 84L,
145L, 114L, 160L, 65L), Ca = c(3918L, 2159L, 3918L, 27256L,
6609L, 16508L), Mg = c(166L, 130L, 166L, 188L, 148L, 95L),
S = c(10L, 16L, 10L, 24L, 24L, 14299L), Na = c(4L, 3L, 4L,
4L, 4L, 4L), Fe = c(19.76, 17, 19.76, 2.31, 1, 0), Zn = c(2.28,
15.1, 2.28, 7.01, 0.8, 0.1), Mn = c(64.16, 19, 64.16, 27.01,
15, 6), Cu = c(0.16, 0.2, 0.16, 0.16, 0.2, 0.2), Kingdom = c("d__Bacteria",
"d__Bacteria", "d__Bacteria", "d__Bacteria", "d__Bacteria",
"d__Bacteria"), Phylum = c("Proteobacteria", "Proteobacteria",
"Proteobacteria", "Proteobacteria", "Proteobacteria", "Actinobacteriota"
), Class = c("Gammaproteobacteria", "Gammaproteobacteria",
"Alphaproteobacteria", "Gammaproteobacteria", "Gammaproteobacteria",
"Actinobacteria"), Order = c("Xanthomonadales", "Pseudomonadales",
"Rhizobiales", "Pseudomonadales", "Pseudomonadales", "Streptomycetales"
), Family = c("Rhodanobacteraceae", "Pseudomonadaceae", "Xanthobacteraceae",
"Pseudomonadaceae", "Pseudomonadaceae", "Streptomycetaceae"
), Genus = c("Rhodanobacter", "Pseudomonas", "Bradyrhizobium",
"Pseudomonas", "Pseudomonas", "Streptomyces"), Species = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_)), row.names = c(2352002L, 511171L, 7348565L,
510815L, 468295L, 621043L), class = "data.frame")
See below for a tidyverse solution. Note that I've used GlobalPatterns from phyloseq to create a reproducible example.
require("phyloseq")
require("tidyverse")
# Load the data and melt it
data(GlobalPatterns)
psdf <- psmelt(GlobalPatterns)
# Function to subset a dataframe based on the size of each group
# in a grouping variable
subset_by_freq <- function(df, grouping_var, threshold){
df %>%
group_by(!!sym(grouping_var)) %>%
filter(n() >= threshold) %>%
ungroup()
}
# Filter out taxa with less than 1e5 counts
psdf_sub <- subset_by_freq(psdf, "Phylum", 1e5)
# Sanity check: count the number of rows per taxon
psdf_sub %>%
group_by(Phylum) %>%
tally()
#> # A tibble: 2 x 2
#> Phylum n
#> <chr> <int>
#> 1 Firmicutes 113256
#> 2 Proteobacteria 166816
Created on 2022-08-12 by the reprex package (v2.0.1)
I have this data with one column in character and one column in value.
data = structure(list(Station = c("1A", "1B", "2A", "2B", "3A", "3B",
"4A", "4B", "5A", "5B", "6A", "6B", "7A", "7B"), Particles.kg = c(370L,
420L, 250L, 320L, 130L, 210L, 290L, 390L, 230L, 340L, 60L, 90L,
130L, 170L)), class = "data.frame", row.names = c(NA, -14L))
now i convert the character in to factor by
data$Station = as.factor(data$Station)
then i start Hierarchical Cluster analysis
rownames(data) = c(data$Station)
data = data[,-1]
require(stats)
res.dist = dist(x=data, method = "euclidean")
hcl = hclust(d=res.dist, method = "complete")
plot(x=hcl, hang = -1, cex = 0.6)
(sorry can't upload the picture for network issue) but after this in my picture intead 1A, 2A, 3A, 3B it comes 1,2,3,.....,14.
how can i solve this?
After dropping the 1st column, there is only one column left which collapses the data into a vector. Vector drops the rownames of the dataframe, hence there are no labels.
You can use drop = FALSE to keep the data as dataframe after the subset.
rownames(data) = data$Station
data = data[,-1, drop = FALSE]
res.dist = dist(x=data, method = "euclidean")
hcl = hclust(d=res.dist, method = "complete")
plot(x=hcl, hang = -1, cex = 0.6)
There is no need to keep only the measurements column, the plot can be done by subsetting the data in the dist instruction.
Note that I use data[-1], not data[, -1], which would drop the dim attribute. The former always returns a sub-df.
rownames(data) <- data$Station
res.dist <- dist(x = data[-1], method = "euclidean")
hcl <- hclust(d = res.dist, method = "complete")
plot(x = hcl, hang = -1, cex = 0.6)
I am trying to run a multiple regression analysis to find the impact of water quality on plankton abundance in a specific location (aka Guzzler). I was able to get my model to run and the summary, however the data is non-parametric so a typical summary would not be reliable. This is mainly due to having a small sample size as it was done over the course of a few weeks and one sample each week.
I was then thinking the non-parametric version of this could be a bootstrap. I've run bootstraps on other data before but never a multiple regression model. I can't seem to find code on how to go about this so began with how I've performed bootstraps in the past. I was curious what I would need to edit in order to get this bootstrap to run.
Here is the output from dput(head(Guzzler1):
structure(list(Abundance = c(98L, 43L, 65L, 55L, 54L), Phospates = c(2L,
2L, 2L, 2L, 2L), Nitrates = c(0, 0.3, 0, 0.15, 0), pH = c(7.5,
8, 7.5, 7, 7)), .Names = c("Abundance", "Phospates", "Nitrates",
"pH"), row.names = c(NA, 5L), class = "data.frame")
Here is my model & the summary:
Guzzler1model<-lm(Abundance ~ Phospates + Nitrates + pH, data=Guzzler1)
> summary(Guzzler1model)
Call:
lm(formula = Abundance ~ Phospates + Nitrates + pH, data = Guzzler1)
Residuals:
1 2 3 4 5
20.75 -4.25 -12.25 8.50 -12.75
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -80.25 209.62 -0.383 0.739
Phospates NA NA NA NA
Nitrates -135.00 90.02 -1.500 0.272
pH 21.00 28.87 0.727 0.543
Residual standard error: 20.41 on 2 degrees of freedom
Multiple R-squared: 0.5302, Adjusted R-squared: 0.06032
F-statistic: 1.128 on 2 and 2 DF, p-value: 0.4698
***Please note: I believe phosphates have NA because each value was equal to 2 in this particular location.
Here is how I was originally performing a bootstrap and unsure what to change:
n=length(Guzzler1Abundance)
B = 1000
results = numeric(B)
for(b in 1:B){
i = sample(x = 1:n, size=n, replace=TRUE)
bootSample = Guzzler1Abundance[i]
thetahat= mean(bootSample)
results[b] = thetahat
}
Thank you so much in advance!
I am not quite shure what you mean by non-parametric data but I understand, you want to take bootstrapped samples from your data and perform linear regression with it.
A possible way to do that would be
Guzzler1 <- structure(list(Abundance = c(98L, 43L, 65L, 55L, 54L), Phospates = c(2L,
2L, 2L, 2L, 2L), Nitrates = c(0, 0.3, 0, 0.15, 0), pH = c(7.5,
8, 7.5, 7, 7)), .Names = c("Abundance", "Phospates", "Nitrates",
"pH"), row.names = c(NA, 5L), class = "data.frame")
lines <- nrow(Guzzler1)
replicate(5, lm(Abundance ~ Phospates + Nitrates + pH,
data=Guzzler1[sample(lines, replace = TRUE),])$coefficients)
This will report the coefficents from 5 linear regressions like this
> replicate(5, lm(Abundance ~ Phospates + Nitrates + pH,
+ data=Guzzler1[sample(lines, replace = TRUE),])$coefficients)
[,1] [,2] [,3] [,4] [,5]
(Intercept) 65.00000 145.000000 -408.0000 145.000000 -100
Phospates NA NA NA NA NA
Nitrates -73.33333 6.666667 -256.6667 6.666667 -110
pH NA -13.000000 66.0000 -13.000000 22
The number of 5 replicates can be chosen arbitrarily higher by changing the first argument of my replicate call. The many NA values are due to the scarce data, as #IRTFM predicted and explained. This will improve as more data is going to be sampled.
Let's sample 5000 bootstrap samples and investigate the distribution of the Nitrate coefficients:
reps <- replicate(5000, lm(Abundance ~ Phospates + Nitrates + pH,
data=Guzzler1[sample(lines, replace = TRUE),])$coefficients)
plot(table(reps["Nitrates",]))
plot(ecdf(reps["Nitrates",]))
quantile(reps["Nitrates",], c(.025, .25, .5, .75, .975), na.rm = TRUE)
Phosphates can be edited to this, one there are variant data:
boxplot(reps["(Intercept)",], reps["Nitrates",], reps["pH",]
, names = c("Intercept", "Nitrates", "pH"), ylab="bootstrapped coefficients")
abline(h=0, col="firebrick", lty=3)
I use the toy dataset (class membership variable & 2 features) below to apply a Gaussian Naive Bayes model and plot the contours of the class-specific bivariate normal distributions.
How to add a line for the decision boundary to the plot below?
Like here:
(Image source: https://alliance.seas.upenn.edu/~cis520/dynamic/2016/wiki/uploads/Lectures/2class_gauss_NB.jpg)
# Packages
library(klaR)
library(MASS)
# Data
d <- structure(list(y = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor"), x1 = c(2, 2.8, 1.5, 2.1, 5.5, 8, 6.9, 8.5, 2.5, 7.7), x2 = c(1.5, 1.2, 1, 1, 4, 4.8, 4.5, 5.5, 2, 3.5)), .Names = c("y", "x1", "x2"), row.names = c(NA, -10L), class = "data.frame")
# Naive Bayes Model
mN <- NaiveBayes(y ~ x1+x2, data = d)
# Data
# Class 1
m1 <- mean(d[which(d$y==1),]$x1)
m2 <- mean(d[which(d$y==1),]$x2)
mu1_2 <- c(m1,m2) # Mean
sd1 <- sd(d[which(d$y==1),]$x1)
sd2 <- sd(d[which(d$y==1),]$x2)
Sigma1_2 <- matrix(c(sd1, 0, 0, sd2), 2) # Covariance matrix
bivn1_2 <- mvrnorm(5000, mu = mu1_2, Sigma = Sigma1_2 ) # from Mass package: Simulate bivariate normal PDF
bivn1_2.kde <- kde2d(bivn1_2[,1], bivn1_2[,2], n = 50) # from MASS package: Calculate kernel density estimate
# Class 0
m3 <- mean(d[which(d$y==0),]$x1)
m4 <- mean(d[which(d$y==0),]$x2)
mu3_4 <- c(m3,m4) # Mean
sd3 <- sd(d[which(d$y==0),]$x1)
sd4 <- sd(d[which(d$y==0),]$x2)
Sigma3_4 <- matrix(c(sd3, 0, 0, sd4), 2) # Covariance matrix
bivn3_4 <- mvrnorm(5000, mu = mu3_4, Sigma = Sigma3_4 ) # from Mass package: Simulate bivariate normal PDF
bivn3_4.kde <- kde2d(bivn3_4[,1], bivn3_4[,2], n = 50) # from MASS package: Calculate kernel density estimate
# Plot
plot(x= d$x1, y=d$x2, xlim=c(-1,10), ylim=c(-1,10), col=d$y, pch=19, cex=2, ylab="x2", xlab="x1")
contour(bivn1_2.kde, add = TRUE, col="darkgrey") # from base graphics package
contour(bivn3_4.kde, add = TRUE, col="darkgrey") # from base graphics package
text(labels = "Class 1",x = 8, y=7, col="grey")
text(labels = "Class 0",x = 0, y=4, col="grey")
I am attempting to run two similar generalized linear mixed models in R. Both models have the same input variables for predictors, covariates and random factors, however, response variables differ. Models require the lme4 package. The issue I was having with the second model has been resolved by Ben Bolker.
In the first model, the response variable is biomass weight and family = gaussian.
global.model <- lmer(ex.drywght ~ forestloss562*forestloss17*roaddenssec*nearestroadprim +
elevation + soilPC1 + soilPC2 +
(1|block/fragment),
data = RespPredComb,
family = "gaussian")
Predictors have the following units:
forestloss562 = %,
forestloss17 = %,
roaddenssec = (km/km2) and
nearestroadprim = (m).
Executing this model brings up the following warning messages:
Warning messages:
1: In glmer(ex.drywght ~ forestloss562 * forestloss17 * roaddenssec * :
calling glmer() with family=gaussian (identity link) as a shortcut to lmer() is deprecated; please call lmer() directly
2: Some predictor variables are on very different scales: consider rescaling
I then perform these subsequent steps (following the sequence of steps described in Grueber et al. (2011):
I standardize predictors,
stdz.model <- standardize(global.model, standardize.y = FALSE)
(requires package arm)
use automated model selection with subsets of the supplied ‘global’ model
model.set <- dredge(stdz.model)
requires package (MuMIn)
Here I get the following warning message:
Warning message:
In dredge(stdz.model2) : comparing models fitted by REML
find the top 2 AIC models and
top.models <- get.models(model.set, subset = delta < 2)
do model averaging
model.avg(model.set, subset = delta < 2)
Here, I get this error message:
Error in apply(apply(z, 2L, is.na), 2, all) :
dim(X) must have a positive length
Any advice on how to possibly fix this error would be very much appreciated.
In the second model, the response variable is richness, family is poisson.
global.model <- glmer(ex.richness ~ forestloss562*forestloss17*roaddenssec*nearestroadprim +
elevation + soilPC1 + soilPC2 +
(1|block/fragment),
data = mydata,
family = "poisson")
When I execute the above command I get the following error and warning messages:
Error: (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
In addition: Warning messages:
1: Some predictor variables are on very different scales: consider rescaling
2: In pwrssUpdate(pp, resp, tolPwrss, GQmat, compDev, fac, verbose) :
Cholmod warning 'not positive definite' at file:../Cholesky/t_cholmod_rowfac.c, line 431
3: In pwrssUpdate(pp, resp, tolPwrss, GQmat, compDev, fac, verbose) :
Cholmod warning 'not positive definite' at file:../Cholesky/t_cholmod_rowfac.c, line 431
Please find a reproducible subset of my data below:
structure(list(plot.code = structure(c(1L, 3L, 2L, 4L, 5L, 6L,
7L), .Label = c("a100m56r", "b1m177r", "c100m56r", "d1f1r", "e1m177r",
"f1m17r", "lf10m56r"), class = "factor"), site.code = structure(c(1L,
3L, 2L, 4L, 5L, 6L, 7L), .Label = c("a100m56", "b1m177", "c100m56",
"d1f1", "e1m177", "f1m17", "lf10m56"), class = "factor"), block = structure(c(1L,
3L, 2L, 4L, 5L, 6L, 7L), .Label = c("a", "b", "c", "d", "e",
"f", "lf"), class = "factor"), fragment = structure(c(1L, 3L,
2L, 4L, 5L, 6L, 7L), .Label = c("a100", "b1", "c100", "d1", "e1",
"f1", "lf10"), class = "factor"), elevation = c(309L, 342L, 435L,
495L, 443L, 465L, 421L), forestloss562 = c(25.9, 56.77, 5.32,
27.4, 24.25, 3.09, 8.06), forestloss17 = c(7.47, 51.93, 79.76,
70.41, 80.55, 0, 0), roaddenssec = c(2.99, 3.92, 2.61, 1.58,
1.49, 1.12, 1.16), nearestroadprim = c(438L, 237L, 2637L, 327L,
655L, 528L, 2473L), soilPC1 = c(0.31, -0.08, 1.67, 2.39, -1.33,
-1.84, -0.25), soilPC2 = c(0.4, 0.41, -0.16, 0.15, 0.03, -0.73,
0.51), ex.richness = c(0L, 0L, 1L, 7L, 0L, 0L, 1L), ex.drywght = c(0,
0, 1.255, 200.2825, 0, 0, 0.04)), .Names = c("plot.code", "site.code",
"block", "fragment", "elevation", "forestloss562", "forestloss17",
"roaddenssec", "nearestroadprim", "soilPC1", "soilPC2", "ex.richness",
"ex.drywght"), class = "data.frame", row.names = c(NA, -7L))
tl;dr you need to standardize your variables before you fit the model, for greater numerical stability. I also have a few comments about the advisability of what you're doing, but I'll save them for the end ...
source("SO_glmer_26904580_data.R")
library("arm")
library("lme4")
library("MuMIn")
Try the first fit:
pmod <- glmer(ex.richness ~
forestloss562*forestloss17*roaddenssec*nearestroadprim +
elevation + soilPC1 + soilPC2 +
(1|block/fragment),
data = dat,
family = "poisson")
This fails, as reported above.
However, I get a warning you didn't report above:
## 1: Some predictor variables are on very different scales: consider rescaling
which provides a clue.
Scaling numeric parameters:
pvars <- c("forestloss562","forestloss17",
"roaddenssec","nearestroadprim",
"elevation","soilPC1","soilPC2")
datsc <- dat
datsc[pvars] <- lapply(datsc[pvars],scale)
Try again:
pmod <- glmer(ex.richness ~
forestloss562*forestloss17*roaddenssec*nearestroadprim +
elevation + soilPC1 + soilPC2 +
(1|block/fragment),
data = datsc,
family = "poisson",
na.action="na.fail")
This works, although we get a warning message about a too-large gradient -- I think this is actually ignorable (we're still working on getting these error sensitivity thresholds right).
As far as I can tell, the following lines seem to be working:
stdz.model <- standardize(pmod, standardize.y = FALSE)
## increases max gradient -- larger warning
model.set <- dredge(stdz.model) ## slow, but running ...
Here are my comments about advisability:
Not even counting random-effects parameters, you have only 8x as many observations as predictor variables. This is pushing it (a rule of thumb is that you should have 10-20 observations per parameter).
nrow(datsc) ## 159
ncol(getME(pmod,"X")) ## 19
Dredging/multi-model-averaging over models with and without interactions can be dangerous -- at the very least, centering continuous variables is necessary in order for it to be interpretable. (I don't know whether dredge does anything to try to be sensible in this case.)
I also tried glmmLasso on this problem -- it ended up shrinking away all of the fixed effect terms ...
library("glmmLasso")
datsc$bf <- interaction(datsc$block,datsc$fragment)
glmmLasso(ex.richness ~
forestloss562+forestloss17+roaddenssec+nearestroadprim +
elevation + soilPC1 + soilPC2,
rnd=list(block=~1,bf=~1),
data = datsc,
family = poisson(),
lambda=500)