Consistency between OOB error rate and the confusion matrix - r

When inspecting the statistics of my models, it looks like the numbers in the confusion matrix are not consistent with those of the OOB error rate in randomForest.
How can I deduce the OOB error rate from the confusion matrix? What is the relationship between them?
In the example below, I print the output from two models, one that was fit with stratified sampling (using a subset of the samples in sampsize) and one that was fit without (i.e. using the default sampling scheme, which I think uses all samples).
I don't have the data public, but here are the function calls:
sumY = summary(Y)
sampsize <- c(sumY["Y0"]/10, sumY["Y1"])
# First model in the image above
strat.rf.model <- randomForest(x=X,y=Y,sampsize=sampsize, strata=Y)
# Second model in the image above
rf.model <- randomForest(x=X,y=Y)

It's not inconsistent, it's just arithmetic:
> 180 / (1699 + 180)
[1] 0.09579564
> 63 / (63 + 58)
[1] 0.5206612
> (180 + 63) / (1699 + 180 + 63 + 58)
[1] 0.1215
The error rate in each class is defined as the proportion of misclassified observations in just that class, whereas the overall misclassification rate is the proportion of misclassified observations for the entire data set.
It is rare for the error rate for each class to exactly match the overall error rate. If you stop and think about it for a second, this makes perfect sense: some classes are going to be harder to identify than others, and then the overall error rate is sort of the "average".

Related

A strange case of singular fit in lme4 glmer - simple random structure with variation among groups

Background
I am trying to test for differences in wind speed data among different groups. For the purpose of my question, I am looking only on side wind (wind direction that is 90 deg from the individual), and I only care about the strength of the wind. Thus, I use absolute values. The range of wind speeds is 0.0004-6.8 m/sec and because I use absolute values, Gamma distribution describes it much better than normal distribution.
My data contains 734 samples from 68 individuals, with each individual having between 1-30 repeats. However, even if I reduce my samples to only include individuals with at least 10 repeats (which leaves me with 26 individuals and a total of 466 samples), I still get the problematic error.
The model
The full model is Wind ~ a*b + c*d + (1|individual), but for the purpose of this question, the simple model of Wind ~ 1 + (1|individual) gives the same singularity error, so I do not think that the explanatory variables are the problem.
The complete code line is glmer(Wind ~ 1 + (1|individual), data = X, family = Gamma(log))
The problem and the strange part
When running the model, I get the boundary (singular) fit: see ?isSingular error, although, as you can see, I use a very simple model and random structure. The strange part is that I can solve this by adding 0.1 to the Wind variable (i.e. glmer(Wind+0.1 ~ 1 + (1|Tag), data = X, family = Gamma(log)) does not give any error). I honestly do not remember why I added 0.1 the first time I did it, but I was surprised to see that it solved the error.
The question
Is this a problem with lme4? Am I missing something? Any ideas what might cause this and why does me adding 0.1 to the variable solve this problem?
Edit following questions
I am not sure what's the best way to add data, so here is a link to a csv file in Google drive
using glmmTMB does not produce any warnings with the basic formula glmmTMB(Wind ~ 1 + (1|Tag), data = X, family = Gamma(log)), but gives convergence problems warnings ('non-positive-definite Hessian matrix') when using the full model (i.e., Wind ~ a*b + c*d + (1|individual)), which are then solved if I scale the continuous variables

mgcv bam() error: cannot allocate vector of size 99.6 Gb

I am trying to fit an additive mixed model using bam (mgcv library). My dataset has 10^6 observations from a longitudinal study on growth in 2.10^5 children nested in 300 health centers. I am looking for the slope for each center.
The model is
bam(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+ center+ year+ year*center+s(child, bs="re"), data)
Whenever, when I try to fit the model the following error message appears:
Error: cannot allocate vector of size 99.6 Gb
In addition: Warning message:
In matrix(by, n, q) : data length exceeds size of matrix
I am working on a cluster with 500 Gb de RAM.
Thank you for any help
To diagnose more precisely where the problem is, try fitting your model with various terms left out. There are several terms in the model that could blow up on you:
the fixed effects involving center will blow up to 300 columns * 10^6 rows; depending on whether year is numeric or a factor, the year*center term could blow up to 600 columns or (nyears*300) columns
it's not clear to me whether bam uses sparse matrices for s(.,bs="re") terms; if not, you'll be in big trouble (2*10^5 columns * 10^6 rows)
Order of magnitude, a vector of 10^6 numeric values (one column of your model matrix) takes 7.6 Mb, so 500 GB / 7.6 MB would be approximately 65,000 columns ...
Just taking a guess here, but I would try out the gamm4 package. It's not specifically geared for low-memory use, but:
‘gamm4’ is most useful when the random effects are not i.i.d., or
when there are large numbers of random coeffecients [sic] (more than
several hundred), each applying to only a small proportion of the
response data.
I would also make most of the terms into random effects:
gamm4::gamm4(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+
(1|center)+ (1|year)+ (1|year:center)+(1|child), data)
or, if there are not very many years in the data set, treat year as a fixed effect:
gamm4::gamm4(haz ~ s(month, bs = "cc", k = 12)+ sex+ s(age)+
year + (1|center)+ (1|year:center)+(1|child), data)
If there are a small number of years then (year|center) might make sense, to assess among-center variation and covariation among years ... if there are many years, consider making it a smooth term instead ...

Finding model predictor values that maximize the outcome

How do you find the set of values for model predictors (a mixture of linear and non-linear) that yield the highest value for the response.
Example Model:
library(lme4); library(splines)
summary(lmer(formula = Solar.R ~ 1 + bs(Ozone) + Wind + Temp + (1 | Month), data = airquality, REML = F))
Here I am interested in what conditions (predictors) produce the highest solar radation (outcome).
This question seems simple, but I've failed to find a good answer using Google.
If the model was simple, I could take the derivatives to find the maximum or minimum. Someone has suggested that if the model function can be extracted, the stats::optim() function might be used. As a last resort, I could simulate all the reasonable variations of input values and plug it into the predict() function and look for the maximum value.
The last approach mentioned doesn't seem very efficient and I imagine that this is a common enough task (e.g., finding optimal customers for advertising) that someone has built some tools for handling it. Any help is appreciated.
There are some conceptual issues here.
for the simple terms (Wind and Temp), the response is a linear (and hence both monotonic and unbounded) function of the predictors. Thus if these terms have positive parameter estimates, increasing their values to infinity (Inf) will give you an infinite response (Solar.R); values should be as small as possible (negative infinite) if the coefficients are negative. Practically speaking, then, you want to set these predictors to the minimum or maximum reasonable value if the parameter estimates are respectively negative or positive.
for the bs term, I'm not sure what the properties of the B-spline are beyond the boundary knots, but I'm pretty sure that the curves go off to positive or negative infinity, so you've got the same issue. However, for the case of bs, it's also possible that there are one or more interior maxima. For this case I would probably try to extract the basis terms and evaluate the spline over the range of the data ...
Alternatively, your mentioning optim makes me think that this is a possibility:
data(airquality)
library(lme4)
library(splines)
m1 <- lmer(formula = Solar.R ~ 1 + bs(Ozone) + Wind + Temp + (1 | Month),
data = airquality, REML = FALSE)
predval <- function(x) {
newdata <- data.frame(Ozone=x[1],Wind=x[2],Temp=x[3])
## return population-averaged prediction (no Month effect)
return(predict(m1, newdata=newdata, re.form=~0))
}
aq <- na.omit(airquality)
sval <- with(aq,c(mean(Ozone),mean(Wind),mean(Temp)))
predval(sval)
opt1 <- optim(fn=predval,
par=sval,
lower=with(aq,c(min(Ozone),min(Wind),min(Temp))),
upper=with(aq,c(max(Ozone),max(Wind),max(Temp))),
method="L-BFGS-B", ## for constrained opt.
control=list(fnscale=-1)) ## for maximization
## opt1
## $par
## [1] 70.33851 20.70000 97.00000
##
## $value
## [1] 282.9784
As expected, this is intermediate in the range of Ozone(1-168), and min/max for Wind (2.3-20.7) and Temp (57-97).
This brute force solution could be made much more efficient by automatically selecting the min/max values for the simple terms and optimizing only over the complex (polynomial/spline/etc.) terms.

Proportion modeling - Betareg errors

I wonder if somebody here can help me.
I am trying to fit a beta GLM with betareg package since my dependent variable is a proportion (relative density of whales in 500m grid size) varying from 0 to 1. I have three covariates:
Depth (measured in meters ranging from 4 to 100m),
Distance to Coast (measured in meters ranging from 0 to 21346m) and
distance to boats (measured in meters ranging from 0 to 20621).
My dependent variable has a lot of 0s and many values that are too close to 0 (as in 7.8e-014). When I try to fit the model the following error shows:
invalid dependent variable, all observations must be in (0, 1).
From what I looked from previous discussions it seems this is caused by my 0s in the dataset (I should not have any 0s or 1s). When I change all my 0 to only positive definite (e.g. 0.0000000000000001) the error message I get is:
Error in chol.default(K) :
the leading minor of order 2 is not positive definite
In addition: Warning messages:
1: In digamma(mu * phi) : NaNs produced
2: In digamma(phi) : NaNs produced
Error in chol.default(K) :
the leading minor of order 2 is not positive definite
In addition: Warning messages:
1: In betareg.fit(X, Y, Z, weights, offset, link, link.phi, type, control) :
failed to invert the information matrix: iteration stopped prematurely
2: In digamma(mu * phi) : NaNs produced
From what I saw at several forums it seems this is because my matrix is not positive definite. It may be either indefinite (i.e. have both positive and negative eigenvalues) or my matrix may be near singular, i.e. it's smallest eigenvalue is very close to 0 (and so computationally it is 0).
My question is: since I only have this dataset, is there any way to solve these problems and run a beta regression? Or, is there any other model that I could use instead of betareg package that it could work?
Here is my code:
betareg(Density~DEPTH+DISTANCE_TO_COAST+DIST_BOAT,data=misti)
When I change all my 0 to only positive definite (e.g. 0.0000000000000001)
Doing this seems like a bad idea, resulting in the error messages you see.
It seems that betareg currently only works strictly for data inside the (0,1) interval, and here's what the package vignette has to say:
The class of beta regression models, as introduced by Ferrari and Cribari-Neto (2004), is useful for modeling continuous variables y that assume values in the open standard unit interval (0, 1). [...] Furthermore, if y also assumes the extremes 0 and 1, a useful transformation in practice is (y · (n − 1) + 0.5)/n where n is the sample size (Smithson and Verkuilen 2006).
So one way to approach this would be:
y.transf.betareg <- function(y){
n.obs <- sum(!is.na(y))
(y * (n.obs - 1) + 0.5) / n.obs
}
betareg( y.transf.betareg(Density) ~ DEPTH+DISTANCE_TO_COAST+DIST_BOAT, data=misti)
For an alternative approach to betareg, using a binomial GLM with a logit link, see this question on Cross Validated and the linked UCLA FAQ:
How to replicate Stata's robust glm for proportion data in R?
Some will suggest using a quasibinomial GLM instead to model proportions/percentages...
Instead of a beta regression, you can just run a linear model using the logistic transformation of your dependent variable. Try the following:
logistic <- function(p) log(p / (1-p) +0.01)
lm(logistic(Density)~DEPTH+DISTANCE_TO_COAST+DIST_BOAT,data=misti)

Cons of setting MaxNWts in R nnet to a very large number

I am using the nnet function package from the nnet package in R. I am trying to set the MaxNWts parameter and was wondering if there is any disadvantage to setting this number to a large value like 10^8 etc. The documentation says
"The maximum allowable number of weights. There is no intrinsic limit
in the code, but increasing MaxNWts will probably allow fits that are
very slow and time-consuming."
I also calculate the size parameter by the following calculation
size = Math.Sqrt(%No of Input Nodes% * %No of Output Nodes%)
The problem is that if I set "MaxNWts" to a value like 10000 , it fails sometimes because the number of coefficients is > 10000 when working with huge data sets.
EDIT
Is there a way to calculate the number of wts( to the get the same number calculated by R nnet) somehow so that I can explicitly set it every time without worrying about the failures?
Suggestions?
This is what I have seen in the Applied Predictive Modeling:
Kuhn M, Johnson K. Applied predictive modeling. 1st edition. New York: Springer. 2013.
for the MaxNWts= argument you can pass either one of:
5 * (ncol(predictors) + 1) + 5 + 1)
or
10 * (ncol(predictors) + 1) + 10 + 1
or
13 * (ncol(predictors) + 1) + 13 + 1
predictors is the matrix of your predictors
I think it is empirical based on your data, it is a regularization term like the idea behind shrinkage methods, ridge regression (L2) term for instance. It's main goal is to prevent the model from over fitting as is the case often with neural networks because of the over-parameterization of its computation.

Resources