clmm model summary does not show p-values - r

I run the following model in R:
clmm_br<-clmm(Grado_amenaza~Life_Form + size_max_cm +
leaf_length_mean + petals_length_mean +
silicua_length_mean + bloom_length + categ_color+ (1|Genero) ,
data=brasic1)
I didn't get any warnings or errors but when I run summary(clmm_br) I can't get the p-values:
summary(clmm_br)
Cumulative Link Mixed Model fitted with the Laplace approximation
formula: Grado_amenaza ~ Life_Form + size_max_cm + leaf_length_mean +
petals_length_mean + silicua_length_mean + bloom_length +
categ_color + (1 | Genero)
data: brasic1
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 76 -64.18 160.36 1807(1458) 1.50e-03 NaN
Random effects:
Groups Name Variance Std.Dev.
Genero (Intercept) 0.000000008505 0.00009222
Number of groups: Genero 39
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Life_Form[T.G] 2.233338 NA NA NA
Life_Form[T.Hem] 0.577112 NA NA NA
Life_Form[T.Hyd] -22.632916 NA NA NA
Life_Form[T.Th] -1.227512 NA NA NA
size_max_cm 0.006442 NA NA NA
leaf_length_mean 0.008491 NA NA NA
petals_length_mean 0.091623 NA NA NA
silicua_length_mean -0.036001 NA NA NA
bloom_length -0.844697 NA NA NA
categ_color[T.2] -2.420793 NA NA NA
categ_color[T.3] 1.268585 NA NA NA
categ_color[T.4] 1.049953 NA NA NA
Threshold coefficients:
Estimate Std. Error z value
1|3 -1.171 NA NA
3|4 1.266 NA NA
4|5 1.800 NA NA
(4 observations deleted due to missingness)
I tried with no random effects and excluding the rows with NAs but it's the same.
The structure of my data:
str(brasic1)
tibble[,13] [80 x 13] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:80] 135 137 142 145 287 295 585 593 646 656 ...
$ Genero : chr [1:80] "Alyssum" "Alyssum" "Alyssum" "Alyssum" ...
$ Cons.stat : chr [1:80] "LC" "VU" "VU" "LC" ...
$ Amenazada : num [1:80] 0 1 1 0 1 0 0 1 0 0 ...
$ Grado_amenaza : Factor w/ 5 levels "1","3","4","5",..: 1 2 2 1 4 1 1 2 1 1 ...
$ Life_Form : chr [1:80] "Th" "Hem" "Hem" "Th" ...
$ size_max_cm : num [1:80] 12 6 7 15 20 27 60 62 50 60 ...
$ leaf_length_mean : num [1:80] 7.5 7 11 14.5 31.5 45 90 65 65 39 ...
$ petals_length_mean : num [1:80] 2.2 3.5 5.5 2.55 6 8 10.5 9.5 9.5 2.9 ...
$ silicua_length_mean: num [1:80] 3.5 4 3.5 4 25 47.5 37.5 41.5 17.5 2.9 ...
$ X2n : num [1:80] 32 NA 16 16 NA NA 20 20 18 14 ...
$ bloom_length : num [1:80] 2 1 2 2 2 2 2 2 11 2 ...
$ categ_color : chr [1:80] "1" "4" "4" "4" ...

For a full answer we really need a reproducible example, but I can point to a few things that raise suspicions.
The fact that you can get estimates, but not standard errors, implies that there is something wrong with the Hessian (the estimate of the curvature of the log-likelihood surface at the maximum likelihood estimate), but there are several distinct (possibly overlapping possibilities)
any time you have a "large" parameter estimate (say, absolute value > 10), as in your example (Life_Form[T.Hyd] = -22.632916), it suggests complete separation, i.e. the presence/absence of that parameter perfectly predicts the response. (You can search for that term, e.g. on CrossValidated.) However, complete separation usually leads to absurdly large standard errors (along with the large parameter estimates) rather than to NAs.
you may have perfect multicollinearity, i.e. combinations of your predictor variables that are perfectly (jointly) correlated with other such combinations. Some R estimation procedures can detect and deal with this case (typically by dropping one or more predictors), but clmm might not be able to. (You should be able to construct your model matrix (X <- model.matrix( your_formula, your_data), excluding the random effect from the formula) and then use caret::findLinearCombos(X) to explore this issue.)
More generally, if you want to do reliable inference you may need to cut down the size of your model (not by stepwise or other forms of model selection); a rule of thumb is that you need 10-20 observations per parameter estimated. You're trying to estimate 12 fixed effect parameters plus a few more (ordinal-threshold parameters and random effect variance) from 80 observations ...
In addition to dropping random effects, it may be useful to a diagnosis to fit a regular linear model with lm() (which should tell you something about collinearity, by dropping parameters) or a binomial model based on some threshold grade values (which might help with identifying complete separation).

Related

Error in MEEM(object, conLin, control$niterEM) in lme function

I'm trying to apply the lme function to my data, but the model gives follow message:
mod.1 = lme(lon ~ sex + month2 + bat + sex*month2, random=~1|id, method="ML", data = AA_patch_GLM, na.action=na.exclude)
Error in MEEM(object, conLin, control$niterEM) :
Singularity in backsolve at level 0, block 1
dput for data, copy from https://pastebin.com/tv3NvChR (too large to include here)
str(AA_patch_GLM)
'data.frame': 2005 obs. of 12 variables:
$ lon : num -25.3 -25.4 -25.4 -25.4 -25.4 ...
$ lat : num -51.9 -51.9 -52 -52 -52 ...
$ id : Factor w/ 12 levels "24641.05","24642.03",..: 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ bat : int -3442 -3364 -3462 -3216 -3216 -2643 -2812 -2307 -2131 -2131 ...
$ year : chr "2005" "2005" "2005" "2005" ...
$ month : chr "12" "12" "12" "12" ...
$ patch_id: Factor w/ 45 levels "111870.17_1",..: 34 34 34 34 34 34 34 34 34 34 ...
$ YMD : Date, format: "2005-12-30" "2005-12-31" "2005-12-31" ...
$ month2 : Ord.factor w/ 7 levels "January"<"February"<..: 7 7 7 7 7 1 1 1 1 1 ...
$ lonsc : num [1:2005, 1] -0.209 -0.213 -0.215 -0.219 -0.222 ...
$ batsc : num [1:2005, 1] 0.131 0.179 0.118 0.271 0.271 ...
What's the problem?
I saw a solution applying the lme4::lmer function, but there is another option to continue to use lme function?
The problem is that you have collinear combinations of predictors. In particular, here are some diagnostics:
## construct the fixed-effect model matrix for your problem
X <- model.matrix(~ sex + month2 + bat + sex*month2, data = AA_patch_GLM)
lc <- caret::findLinearCombos(X)
colnames(X)[lc$linearCombos[[1]]]
## [1] "sexM:month2^6" "(Intercept)" "sexM" "month2.L"
## [5] "month2.C" "month2^4" "month2^5" "month2^6"
## [9] "sexM:month2.L" "sexM:month2.C" "sexM:month2^4" "sexM:month2^5"
This is in a weird order, but it suggests that the sex × month interaction is causing problems. Indeed:
with(AA_patch_GLM, table(sex, month2))
## sex January February March April May June December
## F 367 276 317 204 43 0 6
## M 131 93 90 120 124 75 159
shows that you're missing data for one sex/month combination (i.e., no females were sampled in June).
You can:
construct the sex/month interaction yourself (data$SM <- with(data, interaction(sex, month2, drop = TRUE))) and use ~ SM + bat — but then you'll have to sort out main effects and interactions yourself (ugh)
construct the model matrix by hand (as above), drop the redundant column(s), then include all the resulting columns in the model:
d2 <- with(AA_patch_GLM,
data.frame(lon,
as.data.frame(X),
id))
## drop linearly dependent column
## note data.frame() has "sanitized" variable names (:, ^ both converted to .)
d2 <- d2[names(d2) != "sexM.month2.6"]
lme(reformulate(colnames(d2)[2:15], response = "lon"),
random=~1|id, method="ML", data = d2)
Again, the results will be uglier than the simpler version of the model.
use a patched version of nlme (I submitted a patch here but it hasn't been considered)
remotes::install_github("bbolker/nlme")

Car Package in R: powerTransform() - Error non-finite finite-difference value [1]

Issue:
I have a data frame (called Yeo) containing six parameters with continuous values (columns 5-11)(see parameters below) and I conducted a Shapiro-Wilk test to determine whether or not the univariate samples came from a normal distribution. For each parameter, the residuals showed non-normality and it's skewed, so I want to transform my variables using both the yjPower (Yeo transformation) and the bcPower(Box Cox transformation) families to compare both transformations.
I have used this R code below before on many occassions so I know it works. However, for this data frame, I keep getting this error (see below). Unfortunately, I cannot provide a reproducible example online as the data belongs to three different organisations. I have opened an old data frame with the same parameters and my R code runs absolutely fine. I really can't figure out a solution.
Would anybody be able to please help me understand this error message below?
Many thanks if you can advise.
Error
transform=powerTransform(as.matrix(Yeo[5:11]), family= "yjPower")
Error
Error in optim(start, llik, hessian = TRUE, method = method, ...) :
non-finite finite-difference value [1]
#save transformed data in strand_trans to compare both
stand_trans=Yeo
stand_trans[,5]=yjPower(Yeo[,5],transform$lambda[1])
stand_trans[,6]=yjPower(Yeo[,6],transform$lambda[2])
stand_trans[,7]=yjPower(Yeo[,7],transform$lambda[3])
stand_trans[,8]=yjPower(Yeo[,8],transform$lambda[4])
stand_trans[,9]=yjPower(Yeo[,9],transform$lambda[5])
stand_trans[,10]=yjPower(Yeo[,10],transform$lambda[6])
stand_trans[,11]=yjPower(Yeo[,11],transform$lambda[7])
Parameters
'data.frame': 888 obs. of 14 variables:
$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ Year : num 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
$ Date : Factor w/ 19 levels "","01.09.2019",..: 19 19 19 19 19 19 19 17 17 17 ...
$ Country : Factor w/ 3 levels "","France","Argentina": 3 3 3 3 3 3 3 3 3 3 ...
$ Low.Freq : num 4209 8607 9361 9047 7979 ...
$ High.Freq : num 15770 18220 19853 18220 17843 ...
$ Start.Freq : num 4436 13945 16264 12283 12691 ...
$ End.Freq : num 4436 13945 16264 12283 12691 ...
$ Peak.Freq : num 4594 8906 11531 10781 8812 ...
$ Center.Freq : num 1.137 0.754 0.785 0.691 0.883 ...
$ Delta.Freq : num 11560 9613 10492 9173 9864 ...

Predicting species presence with a random forest model based on unbalanced training data

I want to build a species distribution model using random-forest:
My training data consists of 971 records of species presence (71)/absence (900) and three environmental variables at systematically sampled points (4*4m, random starting point).
Training data:
str(train)
'data.frame': 971 obs. of 4 variables:
$ presence: num 0 0 0 0 0 0 0 0 0 0 ...
$ v1 : num 0.18 0.18 0.24 0.24 0.75 0.7 0.27 0 0.29 0.77 ...
$ v2 : num 10 110 19 99 97 71 64 45 54 74 ...
$ v3 : Factor w/ 3 levels "cat1","cat2",..: 1 1 1 1 2 2 2 3 1 2 ...
model:
model <- randomForest(as.factor(presence) ~ v1 + v2 + v3, data = train)
My test data (test) consists of 1019 records of the same variables including their coordinates at location B. Additionally I have mapped the species occurrence at location B. So I applied the model on that data:
prediction <- predict(model, newdata = test, type="prob")
I set type="prob"because I wanted to predict the occurrence probability of the species.
The data I generated and want to test against the observed occurrence looks like this:
str(prediction_data)
'data.frame': 1019 obs. of 16 variables:
$ x : num 180574 180575 180576 180576 180576 ...
$ y : num 226954 226953 226951 226953 226955 ...
$ v1 : num 0.1131 0.5996 0.7187 0.5885 0.0611 ...
$ v2 : num 10 110 19 99 97 71 64 45 54 74 ...
$ v3 : int 1 1 1 1 1 1 1 1 2 1 ...
$ occurrence_prob : num 0.3252 0.1826 0.0909 0.1014 0.4195 ...
Now my doubt is whether it makes sense to consider the unbalanced training data and try to improve the sensitivity of the prediction by using the parameters e.g. sampsize=(c(71,71)) or classwt = c(0.5, 0.5)in the model building function when finally I also want to set a probability threshold for classifying the species presence by analyzing the receiver operator curve?!
Would this improve the model sensitivity, be redundant or make things worse?
I would really appreciate any thoughts, advice, opinion, hint. Unfortunately I do not know anyone with whom I could discuss my doubts in person. Thank you!

Kaplan Meier Survival curve results differ between R and SAS?

I'm re-running Kaplan-Meier Survival Curves from previously published data, using the exact data set used in the publication (Charpentier et al. 2008 - Inbreeding depression in ring-tailed lemurs (Lemur catta): genetic diversity predicts parasitism, immunocompetence, and survivorship). This publication ran the curves in SAS Version 9, using LIFETEST, to analyze the age at death structured by genetic heterozygosity and sex of the animal (n=64). She reports a Chi square value of 6.31 and a p value of 0.012; however, when I run the curves in R, I get a Chi square value of 0.9 and a p value of 0.821. Can anyone explain this??
R Code used: Age is the time to death, mort is the censorship code, sex is the stratum of gender, and ho2 is the factor delineating the two groups to be compared.
> survdiff(Surv(age, mort1)~ho2+sex,data=mariekmsurv1)
Call:
survdiff(formula = Surv(age, mort1) ~ ho2 + sex, data = mariekmsurv1)
N Observed Expected (O-E)^2/E (O-E)^2/V
ho2=1, sex=F 18 3 3.23 0.0166 0.0215
ho2=1, sex=M 12 3 2.35 0.1776 0.2140
ho2=2, sex=F 17 5 3.92 0.3004 0.4189
ho2=2, sex=M 17 4 5.50 0.4088 0.6621
Chisq= 0.9 on 3 degrees of freedom, p= 0.821
> str(mariekmsurv1)
'data.frame': 64 obs. of 6 variables:
$ id : Factor w/ 65 levels "","aeschylus",..: 14 31 33 30 47 57 51 39 36 3 ...
$ sex : Factor w/ 3 levels "","F","M": 3 2 3 2 2 2 2 2 2 2 ...
$ mort1: int 0 0 0 0 0 0 0 0 0 0 ...
$ age : num 0.12 0.192 0.2 0.23 1.024 ...
$ sex.1: Factor w/ 3 levels "","F","M": 3 2 3 2 2 2 2 2 2 2 ...
$ ho2 : int 1 1 1 2 1 1 1 1 1 2 ...
- attr(*, "na.action")=Class 'omit' Named int [1:141] 65 66 67 68 69 70 71 72 73 74 ...
.. ..- attr(*, "names")= chr [1:141] "65" "66" "67" "68" ...
Some ideas:
Try running it in SAS -- see if you get the same results as the author. Maybe they didn't send you the exact same dataset they used.
Look into the default values of the relevant SAS PROC and compare to the defaults of the R function you are using.
Given the HUGE difference between the Chi-squared (6.81 and 0.9) and P values (0.012 and 0.821) beteween SAS procedure and R procedure for survival analyses; I suspect that you have used wrong variables in the either one of the procedures.
The procedural difference / (data handling difference between SAS and R can cause some very small differences ) .
This is not a software error, this is highly likely to be a human error.

How do I extract lmer fixed effects by observation?

I have a lme object, constructed from some repeated measures nutrient intake data (two 24-hour intake periods per RespondentID):
Male.lme2 <- lmer(BoxCoxXY ~ -1 + AgeFactor + IntakeDay + (1|RespondentID),
data = Male.Data,
weights = SampleWeight)
and I can successfully retrieve the random effects by RespondentID using ranef(Male.lme1). I would also like to collect the result of the fixed effects by RespondentID. coef(Male.lme1) does not provide exactly what I need, as I show below.
> summary(Male.lme1)
Linear mixed model fit by REML
Formula: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID)
Data: Male.Data
AIC BIC logLik deviance REMLdev
9994 10039 -4990 9952 9980
Random effects:
Groups Name Variance Std.Dev.
RespondentID (Intercept) 0.19408 0.44055
Residual 0.37491 0.61230
Number of obs: 4498, groups: RespondentID, 2249
Fixed effects:
Estimate Std. Error t value
(Intercept) 13.98016 0.03405 410.6
AgeFactor4to8 0.50572 0.04084 12.4
AgeFactor9to13 0.94329 0.04159 22.7
AgeFactor14to18 1.30654 0.04312 30.3
IntakeDayDay2Intake -0.13871 0.01809 -7.7
Correlation of Fixed Effects:
(Intr) AgFc48 AgF913 AF1418
AgeFactr4t8 -0.775
AgeFctr9t13 -0.761 0.634
AgFctr14t18 -0.734 0.612 0.601
IntkDyDy2In -0.266 0.000 0.000 0.000
I have appended the fitted results to my data, head(Male.Data) shows
NutrientID RespondentID Gender Age SampleWeight IntakeDay IntakeAmt AgeFactor BoxCoxXY lmefits
2 267 100020 1 12 0.4952835 Day1Intake 12145.852 9to13 15.61196 15.22633
7 267 100419 1 14 0.3632839 Day1Intake 9591.953 14to18 15.01444 15.31373
8 267 100459 1 11 0.4952835 Day1Intake 7838.713 9to13 14.51458 15.00062
12 267 101138 1 15 1.3258785 Day1Intake 11113.266 14to18 15.38541 15.75337
14 267 101214 1 6 2.1198688 Day1Intake 7150.133 4to8 14.29022 14.32658
18 267 101389 1 5 2.1198688 Day1Intake 5091.528 4to8 13.47928 14.58117
The first couple of lines from coef(Male.lme1) are:
$RespondentID
(Intercept) AgeFactor4to8 AgeFactor9to13 AgeFactor14to18 IntakeDayDay2Intake
100020 14.28304 0.5057221 0.9432941 1.306542 -0.1387098
100419 14.00719 0.5057221 0.9432941 1.306542 -0.1387098
100459 14.05732 0.5057221 0.9432941 1.306542 -0.1387098
101138 14.44682 0.5057221 0.9432941 1.306542 -0.1387098
101214 13.82086 0.5057221 0.9432941 1.306542 -0.1387098
101389 14.07545 0.5057221 0.9432941 1.306542 -0.1387098
To demonstrate how the coef results relate to the fitted estimates in Male.Data (which were grabbed using Male.Data$lmefits <- fitted(Male.lme1), for the first RespondentID, who has the AgeFactor level 9-13:
- the fitted value is 15.22633, which equals - from the coeffs - (Intercept) + (AgeFactor9-13) = 14.28304 + 0.9432941
Is there a clever command for me to use that will do want I want automatically, which is to extract the fixed effect estimate for each subject, or am I faced with a series of if statements trying to apply the correct AgeFactor level to each subject to get the correct fixed effect estimate, after deducting the random effect contribution off the Intercept?
Update, apologies, was trying to cut down on the output I was providing and forgot about str(). Output is:
>str(Male.Data)
'data.frame': 4498 obs. of 11 variables:
$ NutrientID : int 267 267 267 267 267 267 267 267 267 267 ...
$ RespondentID: Factor w/ 2249 levels "100020","100419",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Gender : int 1 1 1 1 1 1 1 1 1 1 ...
$ Age : int 12 14 11 15 6 5 10 2 2 9 ...
$ BodyWeight : num 51.6 46.3 46.1 63.2 28.4 18 38.2 14.4 14.6 32.1 ...
$ SampleWeight: num 0.495 0.363 0.495 1.326 2.12 ...
$ IntakeDay : Factor w/ 2 levels "Day1Intake","Day2Intake": 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeAmt : num 12146 9592 7839 11113 7150 ...
$ AgeFactor : Factor w/ 4 levels "1to3","4to8",..: 3 4 3 4 2 2 3 1 1 3 ...
$ BoxCoxXY : num 15.6 15 14.5 15.4 14.3 ...
$ lmefits : num 15.2 15.3 15 15.8 14.3 ...
The BodyWeight and Gender aren't being used (this is the males data, so all the Gender values are the same) and the NutrientID is similarly fixed for the data.
I have been doing horrible ifelse statements sinced I posted, so will try out your suggestion immediately. :)
Update2: this works perfectly with my current data and should be future-proof for new data, thanks to DWin for the extra help in the comment for this. :)
AgeLevels <- length(unique(Male.Data$AgeFactor))
Temp <- as.data.frame(fixef(Male.lme1)['(Intercept)'] +
c(0,fixef(Male.lme1)[2:AgeLevels])[
match(Male.Data$AgeFactor, c("1to3", "4to8", "9to13","14to18", "19to30","31to50","51to70","71Plus") )] +
c(0,fixef(Male.lme1)[(AgeLevels+1)])[
match(Male.Data$IntakeDay, c("Day1Intake","Day2Intake") )])
names(Temp) <- c("FxdEffct")
Below is how I've always found it easiest to extract the individuals' fixed effects and random effects components in the lme4-package. It actually extracts the corresponding fit to each observation. Assuming we have a mixed-effects model of form:
y = Xb + Zu + e
where Xb are the fixed effects and Zu are the random effects, we can extract the components (using lme4's sleepstudy as an example):
library(lme4)
fm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
# Xb
fix <- getME(fm1,'X') %*% fixef(fm1)
# Zu
ran <- t(as.matrix(getME(fm1,'Zt'))) %*% unlist(ranef(fm1))
# Xb + Zu
fixran <- fix + ran
I know that this works as a generalized approach to extracting components from linear mixed-effects models. For non-linear models, the model matrix X contains repeats and you may have to tailor the above code a bit. Here's some validation output as well as a visualization using lattice:
> head(cbind(fix, ran, fixran, fitted(fm1)))
[,1] [,2] [,3] [,4]
[1,] 251.4051 2.257187 253.6623 253.6623
[2,] 261.8724 11.456439 273.3288 273.3288
[3,] 272.3397 20.655691 292.9954 292.9954
[4,] 282.8070 29.854944 312.6619 312.6619
[5,] 293.2742 39.054196 332.3284 332.3284
[6,] 303.7415 48.253449 351.9950 351.9950
# Xb + Zu
> all(round((fixran),6) == round(fitted(fm1),6))
[1] TRUE
# e = y - (Xb + Zu)
> all(round(resid(fm1),6) == round(sleepstudy[,"Reaction"]-(fixran),6))
[1] TRUE
nobs <- 10 # 10 observations per subject
legend = list(text=list(c("y", "Xb + Zu", "Xb")), lines = list(col=c("blue", "red", "black"), pch=c(1,1,1), lwd=c(1,1,1), type=c("b","b","b")))
require(lattice)
xyplot(
Reaction ~ Days | Subject, data = sleepstudy,
panel = function(x, y, ...){
panel.points(x, y, type='b', col='blue')
panel.points(x, fix[(1+nobs*(panel.number()-1)):(nobs*(panel.number()))], type='b', col='black')
panel.points(x, fixran[(1+nobs*(panel.number()-1)):(nobs*(panel.number()))], type='b', col='red')
},
key = legend
)
It is going to be something like this (although you really should have given us the results of str(Male.Data) because model output does not tell us the factor levels for the baseline values:)
#First look at the coefficients
fixef(Male.lme2)
#Then do the calculations
fixef(Male.lme2)[`(Intercept)`] +
c(0,fixef(Male.lme2)[2:4])[
match(Male.Data$AgeFactor, c("1to3", "4to8", "9to13","14to18") )] +
c(0,fixef(Male.lme2)[5])[
match(Male.Data$IntakeDay, c("Day1Intake","Day2Intake") )]
You are basically running the original data through a match function to pick the correct coefficient(s) to add to the intercept ... which will be 0 if the data is the factor's base level (whose spelling I am guessing at.)
EDIT: I just noticed that you put a "-1" in the formula so perhaps all of your AgeFactor terms are listed in the output and you can tale out the 0 in the coefficient vector and the invented AgeFactor level in the match table vector.

Resources