pool.compare generates non-comformable arguments error - r

Alternate title: Model matrix and set of coefficients show different numbers of variables
I am using the mice package for R to do some analyses. I wanted to compare two models (held in mira objects) using pool.compare(), but I keep getting the following error:
Error in model.matrix(formula, data) %*% coefs : non-conformable arguments
The binary operator %*% indicates matrix multiplication in R.
The expression model.matrix(formula, data) produces "The design matrix for a regression-like model with the specified formula and data" (from the R Documentation for model.matrix {stats}).
In the error message, coefs is drawn from est1$qbar, where est1 is a mipo object, and the qbar element is "The average of complete data estimates. The multiple imputation estimate." (from the documentation for mipo-class {mice}).
In my case
est1$qbar is a numeric vector of length 36
data is a data.frame with 918 observations of 82 variables
formula is class 'formula' containing the formula for my model
model.matrix(formula, data) is a matrix with dimension 918 x 48.
How can I resolve/prevent this error?

As occasionally happens, I found the answer to my own question while writing the question.
The clue I was that the estimates for categorical variables in est1.qbar only exist if that level of that variables was present in the data. Some of my variables are factor variables where not every level is represented. This caused the warning "contrasts dropped from factor variable name due to missing levels", which I foolishly ignored.
On the other hand, looking at dimnames(model.matrix.temp)[[2]] shows that the model matrix has one column for each level of each factor variable, regardless of whether that level of that variable was present in the data. So, although the contrasts for missing factor levels are dropped in terms of estimating the coefficients, those factor levels still appear in the model matrix. This means that the model matrix has more columns than the length of est1.qbar (the vector of estimated coefficients), so matrix multiplication is not going to work.
The answer here is to fix the factor variables so that there are no unused levels. This can be done with the factor() function (as explained here). Unfortunately, this needs to be done on the original dataset, prior to imputation.

Related

Removing variables resulting in singular matrix in R regression model

I've been using mnlogit in R to generate a multivariable logistic regression model. My original set of variables generated a singular matrix error, i.e.
Error in solve.default(hessian, gradient, tol = 1e-24) :
system is computationally singular: reciprocal condition number = 7.09808e-25
It turns out that several "sparse" columns (variables that are 0 for most sampled individuals) cause this singularity error. I need a systematic way of removing those variables that lead to a singularity error while retaining those that allow estimation of a regression model, i.e. something analogous to the use of the function step to select variables minimizing AIC via stepwise addition, but this time removing variables that generate singular matrices.
Is there some way to do this, since checking each variable by hand (there are several hundred predictor variables) would be incredibly tedious?
If X is the design matrix from your model which you can obtain using
X <- model.matrix(formula, data = data)
then you can find a (non-unique) set of variables that would give you a non-singular model using the QR decomposition. For example,
x <- 1:3
X <- model.matrix(~ x + I(x^2) + I(x^3))
QR <- qr(crossprod(X)) # Get the QR decomposition
vars <- QR$pivot[seq_len(QR$rank)] # Variable numbers
names <- rownames(QR$qr)[vars] # Variable names
names
#> [1] "(Intercept)" "x" "I(x^2)"
This is subject to numerical error and may not agree with whatever code you are using, for two reasons.
First, it doesn't do any weighting, whereas logistic regression normally uses iteratively reweighted regression.
Second, it might not use the same tolerance as the other code. You can change its sensitivity by changing the tol parameter to qr() from the default 1e-07. Bigger values will cause more variables to be omitted from names.

Why convert numbers to factors while model bulding

I was following a tutorial on model building using logistic regression.
In the tutorial, columns having numeric data type and with levels 3, were converted into factors using as.factor function. I wanted to know the reason for this conversion.
If vectors of class-"numeric" with a small number of unique values are left in that form, logistic regression, i.e. glm( form, family="binomial", ...), will return a single coefficient. Generally, that is not what the data will support, so the authors of that tutorial are advising that these vectors be converted to factors so that the default handling of categorical values by the glm function will occur. It's possible that those authors already know for a fact that the underlying data-gathering process has encoded categorical data with numeric levels and the data input process was not "told" to process as categorical. That could have been done using the colClasses parameter to whichever read.* function was used.
The default handling of factors by most R regression routines uses the first level as part of the baseline (Intercept) estimate and estimates a coefficient for each of the other levels. If you had left that vector as numeric you would have gotten an estimate that could have been interpreted as the slope of an effect of an ordinal variable. The statistical tests associated with such an encoding of an ordinal relationship is often called a "linear test of trend" and is sometimes a useful result when the data situation in the "real world" can be interpreted as an ordinal relationship.

random forest: error in dealing with factor levels in R

I am using rf model in R to predict a binary outcome 0 or 1. I have categorical variables (coded as numbers) in my input data which are coded as factor while training. I use factor() function in R to convert the variable as factor. So for every categorical variablex,my code is like this.
feature_x1=factor(feature_x1) # Convert the variable into factor in training data.
#This variable takes 3 levels 0,1,2
This works perfectly fine while training the model. Let us assume my model object is rf_model. While running the model on new data which is just a vector of numbers. I first convert the number into factors for feature_x1
newdata=data.frame(1,2)
colnames(newdata)=c("feature_x1","feature_x2")
newdata$feature_x1=factor(newdata$feature_x1)
score=pred(rf_model,newdata,type="prob")
I am receiving the following error
Error in predict.randomForest(rf_model, newdata,type = "prob") :
New factor levels not present in the training data
How to deal with this error, because in reality, after training the model we will always have to deal with data for which outcome is unknown which is a just a single record.
Please let me know if more clarity or code is required
Try
newdata$feature_x1 <- factor(newdata$feature_x1, levels=levels(feature_x1))

interpret/extract coefficients from factor variable in glmnet

I have run a logit model through glmnet. I am extracting the coefficients from the minimum lambda, and it gives me the results I expect. However I have a factor variable with nine unique values, and glmnet produces a single coefficient for this, which is expected for a binary variable but not factor...
library(glmnet)
coef(model.obj, s = 'lambda.min')
#output:
TraumaticInj 2.912419e-02
Toxin .
OthInj 4.065266e-03
CurrentSTDYN 7.601812e-01
GeoDiv 1.372628e-02 #this is a factor variable w/ 9 options...
so my questions:
1) how should I interpret a single coefficient from a factor variable in glmnet?
2) is there a method to extract the coefficients for the different factors of the variable?
Glmnet doesn't handle factor variables. You have to convert them to dummies using eg model. Matrix. So the results you are seeing is glmnet treating your factor variable as a single real variable.
Can't be done, b/c glmnet doesn't treat factor variables. This is pretty much answered here: How does glmnet's standardize argument handle dummy variables?
This comment by #R_User in the answer is particularly insightful:
#DTRM - In general, one does not standardize categorical variables to
retain the interpretability of the estimated regressors. However, as
pointed out by Tibshirani here:
statweb.stanford.edu/~tibs/lasso/fulltext.pdf, "The lasso method
requires initial standardization of the regressors, so that the
penalization scheme is fair to all regressors. For categorical
regressors, one codes the regressor with dummy variables and then
standardizes the dummy variables" - so while this causes arbitrary
scaling between continuous and categorical variables, it's done for
equal penalization treatment. – R_User Dec 6 '13 at 1:20

set random forest to classification

I am attempting a random forest on some data where the class variables is binary (either 1 or 0). Here is the code I'm running:
forest.model <- randomForest(x = ticdata2000[,1:85], y = ticdata2000[,86],
ntree=500,
mtry=9,
importance=TRUE,
norm.votes=TRUE,
na.action=na.roughfix,
replace=FALSE,
)
But when the forest gets to the end, I get the following error:
Warning message:
In randomForest.default(x = ticdata2000[, 1:85], y = ticdata2000[, :
The response has five or fewer unique values. Are you sure you want to do regression?
The answer, of course, is no. I don't want to do regression. I have a single, discrete variable that only takes on 2 classes. Of course, when I run predictions with this model, I get continuous numbers, when I want a list of zeroes and ones. Can someone tell me what I'm doing wrong to get this to use regression and not classification?
Change your response column to a factor using as.factor (or just factor). Since you've stored that variable as numeric 0's and 1's, R rightly interprets it as a numeric variable. If you want R to treat it differently, you have to tell it so.
This is mentioned in the documentation under the y argument:
A response vector. If a factor, classification is assumed, otherwise
regression is assumed. If omitted, randomForest will run in
unsupervised mode.

Resources