How to compute error rate from a decision tree? - r

Does anyone know how to calculate the error rate for a decision tree with R?
I am using the rpart() function.

Assuming you mean computing error rate on the sample used to fit the model, you can use printcp(). For example, using the on-line example,
> library(rpart)
> fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis)
> printcp(fit)
Classification tree:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)
Variables actually used in tree construction:
[1] Age Start
Root node error: 17/81 = 0.20988
n= 81
CP nsplit rel error xerror xstd
1 0.176471 0 1.00000 1.00000 0.21559
2 0.019608 1 0.82353 0.82353 0.20018
3 0.010000 4 0.76471 0.82353 0.20018
The Root node error is used to compute two measures of predictive performance, when considering values displayed in the rel error and xerror column, and depending on the complexity parameter (first column):
0.76471 x 0.20988 = 0.1604973 (16.0%) is the resubstitution error rate (i.e., error rate computed on the training sample) -- this is roughly
class.pred <- table(predict(fit, type="class"), kyphosis$Kyphosis)
1-sum(diag(class.pred))/sum(class.pred)
0.82353 x 0.20988 = 0.1728425 (17.2%) is the cross-validated error rate (using 10-fold CV, see xval in rpart.control(); but see also xpred.rpart() and plotcp() which relies on this kind of measure). This measure is a more objective indicator of predictive accuracy.
Note that it is more or less in agreement with classification accuracy from tree:
> library(tree)
> summary(tree(Kyphosis ~ Age + Number + Start, data=kyphosis))
Classification tree:
tree(formula = Kyphosis ~ Age + Number + Start, data = kyphosis)
Number of terminal nodes: 10
Residual mean deviance: 0.5809 = 41.24 / 71
Misclassification error rate: 0.1235 = 10 / 81
where Misclassification error rate is computed from the training sample.

Related

Two-level modelling with lme in R

I am interested in estimating a mixed effect model with two random components (I am sorry for the somewhat unprecise notation. I am somewhat new to these kind of models). Finally, I also want also the standard errors of the variances of the random components. That is why I am somewhat boudn to using the package lme. The reason is that I found this description on how to calculate those standard errors and also interesting, the standard error for function of these variances link.
I believe I know how to use the package lmer. I am finally interested in model2. For the model1, both command yield the same estimates. But model2 with lme yields different results than model2 with lmer from the lme4 package. Could you help me to get around how to set up the random components for lme? This would be much appreciated. Thanks. Please find attached my MWE.
Best
Daniel
#### load all packages #####
loadpackage <- function(x){
for( i in x ){
# require returns TRUE invisibly if it was able to load package
if( ! require( i , character.only = TRUE ) ){
# If package was not able to be loaded then re-install
install.packages( i , dependencies = TRUE )
}
# Load package (after installing)
library( i , character.only = TRUE )
}
}
# Then try/install packages...
loadpackage( c("nlme", "msm", "lmeInfo", "lme4"))
alcohol1 <- read.table("https://stats.idre.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", header=T, sep=",")
attach(alcohol1)
id <- as.factor(id)
age <- as.factor(age)
model1.lmer <-lmer(alcuse ~ 1 + peer + (1|id))
summary(model1.lmer)
model2.lmer <-lmer(alcuse ~ 1 + peer + (1|id) + (1|age))
summary(model2.lmer)
model1.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id, method ="REML")
summary(model1.lme)
model2.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id + 1|age, method ="REML")
Edit (15/09/2021):
Estimating the model as follows end then returning the estimates via nlme::VarCorr gives me different results. While the estimates seem to be in the ball park, it is as they are switched across components.
model2a.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id/age, method ="REML")
summary(model2a.lme)
nlme::VarCorr(model2a.lme)
Variance StdDev
id = pdLogChol(1)
(Intercept) 0.38390274 0.6195989
age = pdLogChol(1)
(Intercept) 0.47892113 0.6920413
Residual 0.08282585 0.2877948
EDIT (16/09/2021):
Since Bob pushed me to think more about my model, I want to give some additional information. Please know that the data I use in the MWE do not match my true data. I just used it for illustrative purposes since I can not upload myy true data. I have a household panel with income, demographic informations and parent indicators.
I am interested in intergenerational mobility. Sibling correlations of permanent income are one industry standard. At the very least, contemporanous observations are very bad proxies of permanent income. Due to transitory shocks, i.e., classical measurement error, those estimates are most certainly attenuated. For this reason, we exploit the longitudinal dimension of our data.
For sibling correlations, this amounts to hypothesising that the income process is as follows:
$$Y_{ijt} = \beta X_{ijt} + \epsilon_{ijt}.$$
With Y being income from individual i from family j in year t. X comprises age and survey year indicators to account for life-cycle effects and macroeconmic conditions in survey years. Epsilon is a compund term comprising a random individual and family component as well as a transitory component (measurement error and short lived shocks). It looks as follows:
$$\epsilon_{ijt} = \alpha_i + \gamma_j + \eta_{ijt}.$$
The variance of income is then:
$$\sigma^2_\epsilon = \sigma^2_\alpha + \sigma^2\gamma + \sigma^2\eta.$$
The quantitiy we are interested in is
$$\rho = \frac(\sigma^2\gamma}{\sigma^2_\alpha + \sigma^2\gamma},$$
which reflects the share of shared family (and other characteristics) among siblings of the variation in permanent income.
B.t.w.: The struggle is simply because I want to have a standard errors for all estimates and for \rho.
This is an example of crossed vs nested random effects. (Note that the example you refer to is fitting a different kind of model, a random-slopes model rather than a model with two different grouping variables ...)
If you try with(alcohol1, table(age,id)) you can see that every id is associated with every possible age (14, 15, 16). Or subset(alcohol1, id==1) for example:
id age coa male age_14 alcuse peer cpeer ccoa
1 1 14 1 0 0 1.732051 1.264911 0.2469111 0.549
2 1 15 1 0 1 2.000000 1.264911 0.2469111 0.549
3 1 16 1 0 2 2.000000 1.264911 0.2469111 0.549
There are three possible models you could fit for a model with random effects of age(indexed by i) and id (indexed by j)
crossed ((1|age) + (1|id)): Y_{ij} = beta0 + beta1*peer + eps1_i + eps2_j +epsr_{ij}; alcohol use varies among individuals and, independently, across ages (this model won't work very well because there are only three distinct ages in the data set, more levels are usually needed)
id nested within age ((1|age/id) = (1|age) + (1|age:id)): Y_{ij} = beta0 + beta1*peer + eps1_i + eps2_{ij} + epsr_{ij}; alcohol use varies across ages, and varies across individuals within ages (see note above about number of levels).
age nested within id ((1|id/age) = (1|id) + (1|age:id)): Y_{ij} = beta0 + beta1*peer + eps1_j + eps2_{ij} + epsr_{ij}; alcohol use varies across individuals, and varies across ages within individuals
Here eps1_i, eps2_{ij}, and epsr_{ij} are normal deviates; epsr is the residual error term.
The latter two models actually don't make sense in this case; because there is only one observation per age/id combination, the nested variance (eps2) is completely confounded with the residual variance (epsr). lme doesn't notice this; if you try to fit one of the nested models in lmer it will give an error that
number of levels of each grouping factor must be < number of observations (problems: id:age)
(Although if you try to compute confidence intervals based on model1.lme you'll get an error "cannot get confidence intervals on var-cov components: Non-positive definite approximate variance-covariance", which is a hint that something is wrong.)
You could restate this problem as saying that the residual variation, and the variation among ages within individuals, are jointly unidentifiable (can't be separated from each other, statistically).
The updated answer here shows how to get the standard errors of the variance components from an lmer model, so you shouldn't be stuck with lme (but you should think carefully about which model you're really trying to fit ...)
The GLMM FAQ might also be useful.
More generally, the standard error of
rho = (V_gamma)/(V_alpha + V_gamma)
will be hard to compute accurately, because this is a nonlinear function of the model parameters. You can apply the delta method, but the most reliable approach would be to use parametric bootstrapping: if you have a fitted model m, then something like this should work:
var_ratio <- function(m) {
v <- as.data.frame(sapply(VarCorr(m), as.numeric))
return(v$family/(v$family + v$id))
}
confint(m, method="boot", FUN =var_ratio)
You should specify random effects in lme by using / not +
By lmer
model2.lmer <-lmer(alcuse ~ 1 + peer + (1|id) + (1|age), data = alcohol1)
summary(model2.lmer)
Linear mixed model fit by REML ['lmerMod']
Formula: alcuse ~ 1 + peer + (1 | id) + (1 | age)
Data: alcohol1
REML criterion at convergence: 651.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.0228 -0.5310 -0.1329 0.5854 3.1545
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.08078 0.2842
age (Intercept) 0.30313 0.5506
Residual 0.56175 0.7495
Number of obs: 246, groups: id, 82; age, 82
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.3039 0.1438 2.113
peer 0.6074 0.1151 5.276
Correlation of Fixed Effects:
(Intr)
peer -0.814
By lme
model2.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id/age, method ="REML")
summary(model2.lme)
Linear mixed-effects model fit by REML
Data: alcohol1
AIC BIC logLik
661.3109 678.7967 -325.6554
Random effects:
Formula: ~1 | id
(Intercept)
StdDev: 0.4381206
Formula: ~1 | age %in% id
(Intercept) Residual
StdDev: 0.4381203 0.7494988
Fixed effects: alcuse ~ 1 + peer
Value Std.Error DF t-value p-value
(Intercept) 0.3038946 0.1438333 164 2.112825 0.0361
peer 0.6073948 0.1151228 80 5.276060 0.0000
Correlation:
(Intr)
peer -0.814
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.0227793 -0.5309669 -0.1329302 0.5853768 3.1544873
Number of Observations: 246
Number of Groups:
id age %in% id
82 82
Okay, finally. Just to sketch my confidential data: I have a panel of individuals. The data includes siblings, identified via mnr. income is earnings, wavey survey year, age age factors. female a factor for gender, pid is the factor identifying the individual.
m1 <- lmer(income ~ age + wavey + female + (1|pid) + (1 | mnr),
data = panel)
vv <- vcov(m1, full = TRUE)
covvar <- vv[58:60, 58:60]
covvar
3 x 3 Matrix of class "dgeMatrix"
cov_pid.(Intercept) cov_mnr.(Intercept) residual
[1,] 2.6528679 -1.4624588 -0.4077576
[2,] -1.4624588 3.1015001 -0.0597926
[3,] -0.4077576 -0.0597926 1.1634680
mean <- as.data.frame(VarCorr(m1))$vcov
mean
[1] 17.92341 16.86084 56.77185
deltamethod(~ x2/(x1+x2), mean, covvar, ses =TRUE)
[1] 0.04242089
The last scalar should be what I interprete as the shared background of the siblings of permanent income.
Thanks to #Ben Bolker who pointed me into this direction.

incorrect logistic regression output

I'm doing logistic regression on Boston data with a column high.medv (yes/no) which indicates if the median house pricing given by column medv is either more than 25 or not.
Below is my code for logistic regression.
high.medv <- ifelse(Boston$medv>25, "Y", "N") # Applying the desired
`condition to medv and storing the results into a new variable called "medv.high"
ourBoston <- data.frame (Boston, high.medv)
ourBoston$high.medv <- as.factor(ourBoston$high.medv)
attach(Boston)
# 70% of data <- Train
train2<- subset(ourBoston,sample==TRUE)
# 30% will be Test
test2<- subset(ourBoston, sample==FALSE)
glm.fit <- glm (high.medv ~ lstat,data = train2, family = binomial)
summary(glm.fit)
The output is as follows:
Deviance Residuals:
[1] 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.57 48196.14 0 1
lstat NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.0000e+00 on 0 degrees of freedom
Residual deviance: 3.1675e-10 on 0 degrees of freedom
AIC: 2
Number of Fisher Scoring iterations: 21
Also i need:
Now I'm required to use the misclassification rate as the measure of error for the two cases:
using lstat as the predictor, and
using all predictors except high.medv and medv.
but i am stuck at the regression itself
With every classification algorithm, the art relies on choosing the threshold upon which you will determine whether the the result is positive or negative.
When you predict your outcomes in the test data set you estimate probabilities of the response variable being either 1 or 0. Therefore, you need to the tell where you are gonna cut, the threshold, at which the prediction becomes 1 or 0.
A high threshold is more conservative about labeling a case as positive, which makes it less likely to produce false positives and more likely to produce false negatives. The opposite happens for low thresholds.
The usual procedure is to plot the rates that interests you, e.g., true positives and false positives against each other, and then choose what is the best rate for you.
set.seed(666)
# simulation of logistic data
x1 = rnorm(1000) # some continuous variables
z = 1 + 2*x1 # linear combination with a bias
pr = 1/(1 + exp(-z)) # pass through an inv-logit function
y = rbinom(1000, 1, pr)
df = data.frame(y = y, x1 = x1)
df$train = 0
df$train[sample(1:(2*nrow(df)/3))] = 1
df$new_y = NA
# modelling the response variable
mod = glm(y ~ x1, data = df[df$train == 1,], family = "binomial")
df$new_y[df$train == 0] = predict(mod, newdata = df[df$train == 0,], type = 'response') # predicted probabilities
dat = df[df$train==0,] # test data
To use missclassification error to evaluate your model, first you need to set up a threshold. For that, you can use the roc function from pROC package, which calculates the rates and provides the corresponding thresholds:
library(pROC)
rates =roc(dat$y, dat$new_y)
plot(rates) # visualize the trade-off
rates$specificity # shows the ratio of true negative over overall negatives
rates$thresholds # shows you the corresponding thresholds
dat$jj = as.numeric(dat$new_y>0.7) # using 0.7 as a threshold to indicate that we predict y = 1
table(dat$y, dat$jj) # provides the miss classifications given 0.7 threshold
0 1
0 86 20
1 64 164
The accuracy of your model can be computed as the ratio of the number of observations you got right against the size of your sample.

Tree sizes given by CP table in rpart

In the R package rpart, what determines the size of trees presented within the CP table for a decision tree? In the below example, the CP table defaults to presenting only trees with 1, 2, and 5 nodes (as nsplit = 0, 1 and 4 respectively).
library(rpart)
fit <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis)
> printcp(fit)
Classification tree:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
method = "class")
Variables actually used in tree construction:
[1] Age Start
Root node error: 17/81 = 0.20988
n= 81
CP nsplit rel error xerror xstd
1 0.176471 0 1.00000 1.00000 0.21559
2 0.019608 1 0.82353 0.94118 0.21078
3 0.010000 4 0.76471 0.94118 0.21078
Is there an inherent rule rpart() used to determine what size of trees to present? And is it possible to force printcp() to return cross-validation statistics for all possible sizes of tree, i.e. for the above example, also include rows for trees with 3 and 4 nodes (nsplit = 2, 3)?
The rpart() function is controlled using the rpart.control() function. It has parameters such as minsplit which tells the function to only split when there are more observations then the value specified and cp which tells the function to only split if the overall lack of fit is decreased by a factor of cp.
If you look at summary(fit) on your above example it shows the statistics for all values of nsplit. To get these values to print when using printcp(fit) you need to choose appropriate values of cp and minsplit when calling the original rpart function.
The cran-r documentation on rpart mentions adding option cp=0 to the rpart function. http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
It also mentions other options which can be given in the rpart function for eg to control the number of splits.
dfit <- rpart(y ~ x, method='class',
control = rpart.control(xval = 10, minbucket = 2, **cp = 0**))

Error in nlme repeated measures

I'm trying to run a linear mixed model with repeated measures at 57 different timepoints. But I keep getting the error message:
Error in solve.default(estimates[dimE[1L] - (p:1), dimE[2L] - (p:1), drop = FALSE]) :
system is computationally singular: reciprocal condition number = 7.7782e-18
What does this mean?
My code is such:
model.dataset = data.frame(TimepointM=timepoint,SubjectM=sample,GeneM=gene)
library("nlme")
model = lme(score ~ TimepointM + GeneM,data=model.dataset,random = ~1|SubjectM)
Here's the data:
score = c(2,-3,11,14,1,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,6,7,2,-3,11,14,1,7,6,7,2,-3,11,14,1,7,6,2,-3,11,14,1,7,7,2,-3,11,14,1,7,6,2,-3,11,14,1,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,2,-3,11,14,1,7,6,2,-3,11,14,1,7,7,2,-3,11,14,1,7,6,7,2,-3,11,14,1,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,6,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,6,7,2,-3,11,14,1,7,6,2,-3,11,14,1,7,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7,2,-3,11,14,1,7)
timepoint = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,7,8,8,8,8,8,8,8,9,9,9,9,9,9,10,10,10,10,10,10,10,12,12,12,12,12,12,12,12,13,13,13,13,13,13,13,13,14,14,14,14,14,14,14,14,15,15,15,15,15,15,16,16,16,16,16,16,16,16,17,17,17,17,17,17,18,18,18,18,18,18,19,19,19,19,19,19,19,20,20,20,20,20,20,21,21,21,21,21,21,24,24,24,24,24,24,24,25,25,25,25,25,25,25,27,27,27,27,27,27,28,28,28,28,28,28,29,29,29,29,29,29,30,30,30,30,30,30,30,31,31,31,31,31,31,31,32,32,32,32,32,32,33,33,33,33,33,33,33,33,34,34,34,34,34,34,34,35,35,35,35,35,35,36,36,36,36,36,36,36,37,37,37,37,37,37,38,38,38,38,38,38,39,39,39,39,39,39,39,40,40,40,40,40,40,40,41,41,41,41,41,41,41,41,42,42,42,42,42,42,42,42,43,43,43,43,43,43,44,44,44,44,44,44,44,45,45,45,45,45,45,46,46,46,46,46,46,47,47,47,47,47,47,48,48,48,48,48,48,49,49,49,49,49,49,49,50,50,50,50,50,50,51,51,51,51,51,51,52,52,52,52,52,52,52,53,53,53,53,53,53,53,54,54,54,54,54,54,55,55,55,55,55,55,56,56,56,56,56,56,57,57,57,57,57,57)
sample = c("S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S13T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S13T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S13T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S13T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S13T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S13T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S13T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S13T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S13T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S01T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0","S02T0","S03T0","S07T0","S09T0","S10T0","S12T0")
gene =c(24.1215870,-18.8771658,-27.3747309,-41.5740199,26.1561877,-2.7836332,20.8322796,36.5745088,-24.1541743,-11.2362216,4.9042852,7.4230219,155.8663563,16.4465366,-11.7982286,-1.6102783,-35.9559091,27.7909495,-13.9181661,-29.6037658,-68.4297261,-45.0877920,-48.3157529,17.1649982,-26.9084544,19.7358439,-5.8991143,-24.1541743,-23.5960654,13.0780939,-2.7836332,18.6394081,-28.3157487,-49.9186269,-33.7086648,41.6864242,-30.6199654,36.1823804,-36.5745088,-49.9186269,-44.9448864,-4.9042852,-34.3314764,62.3465425,-42.7609951,-11.7982286,-32.2055657,-56.1811080,5.7216661,-17.6296771,4.3857431,-43.6534459,9.6616697,-44.9448864,18.7997599,-12.9902884,109.1064494,7.6750504,-43.6534459,-17.7130611,-25.8433097,5.7216661,-18.5575548,35.2750175,36.1823804,2.3596457,-25.7644526,-55.0574858,15.5302365,-19.4854325,73.3687689,63.1668918,20.8322796,16.5175201,-22.5438960,-28.0905540,15.5302365,7.4230219,39.5062602,107.4657509,36.1823804,-23.5964573,-45.0877920,-43.8212642,4.0869043,-40.8266205,26.3375068,13.1572292,-25.9561030,-40.2569571,-52.8102415,2.4521426,-49.1775202,246.1047731,36.1823804,11.7982286,-35.4261223,-26.9669318,-2.4521426,-38.0429873,38.5656349,9.8679219,16.5175201,8.0513914,-42.6976421,26.9735686,-26.9084544,4.3857431,12.9780515,-32.2055657,-33.7086648,9.8085704,-36.2800196,215.7518511,6.5786146,-9.4385829,-19.3233394,-40.4503978,17.1649982,-7.4230219,14.2536650,-23.5964573,-53.1391834,-52.8102415,22.0692834,-54.7447866,24.1215870,-44.8332688,-24.1541743,-42.6976421,26.9735686,-40.8266205,191.1413737,17.5429723,-70.7893718,-37.0364006,-39.3267756,-4.9042852,-0.9278777,93.5198138,-6.5786146,-24.7762801,-28.9850091,-39.3267756,22.0692834,-50.1053979,14.2536650,23.5964573,-20.9336177,-53.9338637,14.7128556,-39.8987428,4.3857431,-64.8902575,-59.5802966,-33.7086648,22.0692834,2.7836332,46.0503024,-35.3946859,-43.4775137,-53.9338637,30.2430921,-34.3314764,80.3942259,28.5073300,-87.3068919,-24.1541743,-62.9228410,13.0780939,-25.0526990,35.0859447,-24.7762801,-38.6466789,-58.4283523,31.0604729,0.0000000,24.4562563,1.0964358,-27.1359259,-75.6830794,-16.8543324,20.4345217,-11.1345329,74.1390629,18.2282447,-27.3044720,-45.2890768,-46.7707724,15.3258912,-27.9523169,-6.9763039,117.3099418,18.6394081,-21.2368115,-38.6466789,-34.8322870,22.0692834,-48.2496425,6.5786146,-64.8902575,-51.5289052,-80.9007955,23.7040451,-26.9084544,223.1349942,8.7714862,10.6184058,-127.2119846,-31.4614205,0.8173809,-16.7017993,9.8679219,-35.3946859,-54.7494617,-44.9448864,14.7128556,-18.5575548,97.5827836,-166.3550237,-95.0064189,-123.5984376,104.6247509,-121.5519839,33.9895089,-44.8332688,-40.2569571,-56.1811080,51.4949946,0.0000000,-16.9312544,95.9808615,6.5786146,-21.2368115,-9.6616697,-13.4834659,10.6259513,-25.9805767,116.4895926,-1.0964358,-16.5175201,-56.3597400,-44.9448864,13.8954747,-12.9902884,-5.6437515,71.3703842,25.2180227,-41.2938002,-53.1391834,-32.5850426,8.9911895,12.9902884,31.9812582,1.0964358,-70.7893718,-33.8158440,-38.2031534,-15.5302365,-25.0526990,153.4053085,36.1823804,-34.2148630,-41.8672354,-19.1015767,22.8866643,0.9278777,20.8322796,-29.4955716,-43.4775137,-69.6645739,33.5126155,-45.4660092,26.3144585,-33.0350402,24.1541743,-42.6976421,0.0000000,-28.7642099,38.3752520,-7.0789372,-22.5438960,-20.2251989,34.3299964,19.4854325,4.3857431,-61.3507889,-33.8158440,-64.0464631,39.2342816,-28.7642099,183.7582306,-4.3857431,-22.4166344,-28.9850091,-57.3047302,25.3388069,-26.9084544,35.0859447,7.0789372,-33.8158440,-43.8212642,-1.6347617,5.5672664,-35.0859447,-40.1139773,-14.4925046,-12.3598438,21.2519025,-14.8460438,119.7709896,30.7002016,-22.4166344,-46.6980703,-43.8212642,5.7216661,-10.2066551,203.4466124,116.2221917,-83.7674233,-109.4989234,-38.2031534,78.4685632,-56.6005421,21.9287154,-63.7104346,-56.3597400,-4.4944886,25.3388069,-73.3023414,29.6037658,-31.8552173,-46.6980703,-79.7771734,21.2519025,-18.5575548,16.4465366,-27.1359259,-43.4775137,-41.5740199,-11.4433321,-23.1969435,27.4108943,-84.9472461,-53.1391834,-40.4503978,22.8866643,16.7017993)
tl;dr I think your problem is that every individual has exactly the same response value (score) for every time point (i.e. perfect homogeneity within individuals), so the random-effects term completely explains the data; there's nothing left over for the fixed effects. Are you sure you didn't want to use gene as your response variable?? (Discovered after running through a bunch of modeling attempts, by plotting the damn data, something everyone should always do first ...)
## simplifying names etc. slightly
dd <- data.frame(timepoint,sample,gene,score,)
library("nlme")
m0 <- lme(score ~ timepoint + gene, data=dd,
random = ~1|sample)
## reproduces error
As a first check, let's just see if there's something in your fixed-effect model that is singular:
lm(score~timepoint+gene,dd)
##
## Call:
## lm(formula = score ~ timepoint + gene, data = dd)
##
## Coefficients:
## (Intercept) timepoint gene
## 5.414652 -0.004064 -0.024485
No, that works fine.
Let's try it in lme4:
library(lme4)
m1 <- lmer(score ~ timepoint + gene + (1|sample), data=dd)
## Error in fn(x, ...) : Downdated VtV is not positive definite
Let's try scaling & centering the data -- sometimes that helps:
ddsc <- transform(dd,
timepoint=scale(timepoint),
gene=scale(gene))
lme still fails:
m0sc <- lme(score ~ timepoint + gene, data=ddsc,
random = ~1|sample)
lmer works -- sort of!
m1sc <- lmer(score ~ timepoint + gene + (1|sample), data=ddsc)
## Warning message:
## In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
The results give coefficients for the parameters that are vanishingly close to zero. (The residual variance is also vanishingly small.)
## m1sc
## Linear mixed model fit by REML ['lmerMod']
## Formula: score ~ timepoint + gene + (1 | sample)
## Data: ddsc
## REML criterion at convergence: -9062.721
## Random effects:
## Groups Name Std.Dev.
## sample (Intercept) 7.838e-01
## Residual 3.344e-07
## Number of obs: 348, groups: sample, 8
## Fixed Effects:
## (Intercept) timepoint gene
## 5.714e+00 -4.194e-16 -1.032e-14
At this point I can only think of a couple of possibilities:
there's something about the experimental design that means the random effects are somehow (?) completely confounded with one or both of the fixed effects
these are simulated data that are artificially constructed to be perfectly balanced ... ?
library(ggplot2); theme_set(theme_bw())
ggplot(dd,aes(timepoint,score,group=sample,colour=gene))+
geom_point(size=4)+
geom_line(colour="red",alpha=0.5)
Aha!
In order for R to solve a matrix, it needs to be computationally invertible. The error you are getting back is telling you that, for computational purposes, your matrix is singular, which means it does not have an inverse.
As this error deals more with the statistical theory side, it's probably better suited for cross-validated. See this link for more information.
Check your data to make sure you do not have perfectly correlated independent variables.

Error with nlme

For IGF data from nlme library, I'm getting this error message:
lme(conc ~ 1, data=IGF, random=~age|Lot)
Error in lme.formula(conc ~ 1, data = IGF, random = ~age | Lot) :
nlminb problem, convergence error code = 1
message = iteration limit reached without convergence (10)
But everything is fine with this code
lme(conc ~ age, data=IGF)
Linear mixed-effects model fit by REML
Data: IGF
Log-restricted-likelihood: -297.1831
Fixed: conc ~ age
(Intercept) age
5.374974367 -0.002535021
Random effects:
Formula: ~age | Lot
Structure: General positive-definite
StdDev Corr
(Intercept) 0.082512196 (Intr)
age 0.008092173 -1
Residual 0.820627711
Number of Observations: 237
Number of Groups: 10
As IGF is groupedData, so both codes are identical. I'm confused why the first code produces error. Thanks for your time and help.
I find the other, older answer here unsatisfactory. I distinguish between cases where, statistically, age has no impact and conversely we encounter a computational error. Personally, I have made career mistakes by conflating these two cases. R has signaled the latter and I would like to dive into why that is.
The model that OP has specified is a growth model, with random slopes and intercepts. A grand intercept is included but not a grand age slope. One unsavory constraint that is imposed by fitting a random slope without addition of its "grand" term is that you are forcing the random slope to have 0 mean, which is very difficult to optimize. Marginal models indicate age does not have a statistically significant different value from 0 in the model. Furthermore adding age as a fixed effect does not remedy the problem.
> lme(conc~ age, random=~age|Lot, data=IGF)
Error in lme.formula(conc ~ age, random = ~age | Lot, data = IGF) :
nlminb problem, convergence error code = 1
message = iteration limit reached without convergence (10)
Here the error is obvious. It might be tempting to set the number of iterations up. lmeControl has many iterative estimands. But even that doesn't work:
> fit <- lme(conc~ 1, random=~age|Lot, data=IGF,
control = lmeControl(maxIter = 1e8, msMaxIter = 1e8))
Error in lme.formula(conc ~ 1, random = ~age | Lot,
data = IGF, control = lmeControl(maxIter = 1e+08, :
nlminb problem, convergence error code = 1
message = singular convergence (7)
So it's not a precision thing, the optimizer is running out-of-bounds.
There must be key differences between the two models you have proposed fitting, and a way to diagnose the error that you have found. One simple approach is specifying a "verbose" fit for the problematic model:
> lme(conc~ 1, random=~age|Lot, data=IGF, control = lmeControl(msVerbose = TRUE))
0: 602.96050: 2.63471 4.78706 141.598
1: 602.85855: 3.09182 4.81754 141.597
2: 602.85312: 3.12199 4.97587 141.598
3: 602.83803: 3.23502 4.93514 141.598
(truncated)
48: 602.76219: 6.22172 4.81029 4211.89
49: 602.76217: 6.26814 4.81000 4425.23
50: 602.76216: 6.31630 4.80997 4638.57
50: 602.76216: 6.31630 4.80997 4638.57
The first term is the REML (I think). The second through fourth terms are the parameters to an object called lmeSt of class lmeStructInt, lmeStruct, and modelStruct. If you use Rstudio's debugger to inspect attributes of this object (the lynchpin of the problem), you'll see it is the random effects component that explodes here. coef(lmeSt) after 50 iterations produces
reStruct.Lot1 reStruct.Lot2 reStruct.Lot3
6.316295 4.809975 4638.570586
as seen above and produces
> coef(lmeSt, unconstrained = FALSE)
reStruct.Lot.var((Intercept)) reStruct.Lot.cov(age,(Intercept))
306382.7 2567534.6
reStruct.Lot.var(age)
21531399.4
which is the same as the
Browse[1]> lmeSt$reStruct$Lot
Positive definite matrix structure of class pdLogChol representing
(Intercept) age
(Intercept) 306382.7 2567535
age 2567534.6 21531399
So it's clear the covariance of the random effects is something that's exploding here for this particular optimizer. PORT routines in nlminb have been criticized for their uninformative errors. The text from David Gay (Bell Labs) is here http://ms.mcmaster.ca/~bolker/misc/port.pdf The PORT documentation suggests our error 7 from using a 1 billion iter max "x may have too many free components. See ยง5.". Rather than fix the algorithm, it behooves us to ask if there are approximate results which should generate similar outcomes. It is, for instance, easy to fit an lmList object to come up with the random intercept and random slope variance:
> fit <- lmList(conc ~ age | Lot, data=IGF)
> cov(coef(fit))
(Intercept) age
(Intercept) 0.13763699 -0.018609973
age -0.01860997 0.003435819
although ideally these would be weighted by their respective precision weights:
To use the nlme package I note that unconstrained optimization using BFGS does not produce such an error and gives similar results:
> lme(conc ~ 1, data=IGF, random=~age|Lot, control = lmeControl(opt = 'optim'))
Linear mixed-effects model fit by REML
Data: IGF
Log-restricted-likelihood: -292.9675
Fixed: conc ~ 1
(Intercept)
5.333577
Random effects:
Formula: ~age | Lot
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 0.032109976 (Intr)
age 0.005647296 -0.698
Residual 0.820819785
Number of Observations: 237
Number of Groups: 10
An alternative syntactical declaration of such a model can be done with the MUCH easier lme4 package:
library(lme4)
lmer(conc ~ 1 + (age | Lot), data=IGF)
which yields:
> lmer(conc ~ 1 + (age | Lot), data=IGF)
Linear mixed model fit by REML ['lmerMod']
Formula: conc ~ 1 + (age | Lot)
Data: IGF
REML criterion at convergence: 585.7987
Random effects:
Groups Name Std.Dev. Corr
Lot (Intercept) 0.056254
age 0.006687 -1.00
Residual 0.820609
Number of obs: 237, groups: Lot, 10
Fixed Effects:
(Intercept)
5.331
An attribute of lmer and its optimizer is that random effects correlations which are very close to 1, 0, or -1 are simply set to those values since it simplifies the optimization (and statistical efficiency of the estimation) substantially.
Taken together, this does not suggest that age does not have an effect, as was said earlier, and this argument can be supported by the numeric results.
If you plot the data, you can see that there is no effect of age, so it seems strange to be trying to fit a random effect of age in spite of this. No wonder it is not converging.
library(nlme)
library(ggplot2)
dev.new(width=6, height=3)
qplot(age, conc, data=IGF) + facet_wrap(~Lot, nrow=2) + geom_smooth(method='lm')
I think what you want to do is model a random effect of Lot on the intercept. We can try including age as a fixed effect, but we'll see that it is not significant and can be thrown out:
> summary(lme(conc ~ 1 + age, data=IGF, random=~1|Lot))
Linear mixed-effects model fit by REML
Data: IGF
AIC BIC logLik
604.8711 618.7094 -298.4355
Random effects:
Formula: ~1 | Lot
(Intercept) Residual
StdDev: 0.07153912 0.829998
Fixed effects: conc ~ 1 + age
Value Std.Error DF t-value p-value
(Intercept) 5.354435 0.10619982 226 50.41849 0.0000
age -0.000817 0.00396984 226 -0.20587 0.8371
Correlation:
(Intr)
age -0.828
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-5.46774548 -0.43073893 -0.01519143 0.30336310 5.28952876
Number of Observations: 237
Number of Groups: 10

Resources