set random forest to classification - r

I am attempting a random forest on some data where the class variables is binary (either 1 or 0). Here is the code I'm running:
forest.model <- randomForest(x = ticdata2000[,1:85], y = ticdata2000[,86],
ntree=500,
mtry=9,
importance=TRUE,
norm.votes=TRUE,
na.action=na.roughfix,
replace=FALSE,
)
But when the forest gets to the end, I get the following error:
Warning message:
In randomForest.default(x = ticdata2000[, 1:85], y = ticdata2000[, :
The response has five or fewer unique values. Are you sure you want to do regression?
The answer, of course, is no. I don't want to do regression. I have a single, discrete variable that only takes on 2 classes. Of course, when I run predictions with this model, I get continuous numbers, when I want a list of zeroes and ones. Can someone tell me what I'm doing wrong to get this to use regression and not classification?

Change your response column to a factor using as.factor (or just factor). Since you've stored that variable as numeric 0's and 1's, R rightly interprets it as a numeric variable. If you want R to treat it differently, you have to tell it so.
This is mentioned in the documentation under the y argument:
A response vector. If a factor, classification is assumed, otherwise
regression is assumed. If omitted, randomForest will run in
unsupervised mode.

Related

Calculating VIF for ordinal logistic regression & multicollinearity in R

I am running an ordinal regression model. I have 8 explanatory variables, 4 of them categorical ('0' or '1') , 4 of them continuous. Beforehand I want to be sure there's no multicollinearity, so I use the variance inflation factor (vif function from the car package) :
mod1<-polr(Y ~ X1+X2+X3+X4+X5+X6+X7+X8, Hess = T, data=df)
vif(mod1)
but I get a VIF value of 125 for one of the variables, as well as the following warning :
Warning message: In vif.default(mod1) : No intercept: vifs may not be sensible.
However, when I convert my dependent variable to numeric (instead of a factor), and do the same thing with a linear model :
mod2<-lm(Y ~ X1+X2+X3+X4+X5+X6+X7+X8, data=df)
vif(mod2)
This time all the VIF values are below 3, suggesting that there's no multicollinearity.
I am confused about the vif function. How can it return VIFs > 100 for one model and low VIFs for another ? Should I stick with the second result and still do an ordinal model anyway ?
The vif() function uses determinants of the correlation matrix of the parameters (and subsets thereof) to calculate the VIF. In the linear model, this includes just the regression coefficients (excluding the intercept). The vif() function wasn't intended to be used with ordered logit models. So, when it finds the variance-covariance matrix of the parameters, it includes the threshold parameters (i.e., intercepts), which would normally be excluded by the function in a linear model. This is why you get the warning you get - it doesn't know to look for threshold parameters and remove them. Since the VIF is really a function of inter-correlations in the design matrix (which doesn't depend on the dependent variable or the non-linear mapping from the linear predictor into the space of the response variable [i.e., the link function in a glm]), you should get the right answer with your second solution above, using lm() with a numeric version of your dependent variable.

Removing variables resulting in singular matrix in R regression model

I've been using mnlogit in R to generate a multivariable logistic regression model. My original set of variables generated a singular matrix error, i.e.
Error in solve.default(hessian, gradient, tol = 1e-24) :
system is computationally singular: reciprocal condition number = 7.09808e-25
It turns out that several "sparse" columns (variables that are 0 for most sampled individuals) cause this singularity error. I need a systematic way of removing those variables that lead to a singularity error while retaining those that allow estimation of a regression model, i.e. something analogous to the use of the function step to select variables minimizing AIC via stepwise addition, but this time removing variables that generate singular matrices.
Is there some way to do this, since checking each variable by hand (there are several hundred predictor variables) would be incredibly tedious?
If X is the design matrix from your model which you can obtain using
X <- model.matrix(formula, data = data)
then you can find a (non-unique) set of variables that would give you a non-singular model using the QR decomposition. For example,
x <- 1:3
X <- model.matrix(~ x + I(x^2) + I(x^3))
QR <- qr(crossprod(X)) # Get the QR decomposition
vars <- QR$pivot[seq_len(QR$rank)] # Variable numbers
names <- rownames(QR$qr)[vars] # Variable names
names
#> [1] "(Intercept)" "x" "I(x^2)"
This is subject to numerical error and may not agree with whatever code you are using, for two reasons.
First, it doesn't do any weighting, whereas logistic regression normally uses iteratively reweighted regression.
Second, it might not use the same tolerance as the other code. You can change its sensitivity by changing the tol parameter to qr() from the default 1e-07. Bigger values will cause more variables to be omitted from names.

Why convert numbers to factors while model bulding

I was following a tutorial on model building using logistic regression.
In the tutorial, columns having numeric data type and with levels 3, were converted into factors using as.factor function. I wanted to know the reason for this conversion.
If vectors of class-"numeric" with a small number of unique values are left in that form, logistic regression, i.e. glm( form, family="binomial", ...), will return a single coefficient. Generally, that is not what the data will support, so the authors of that tutorial are advising that these vectors be converted to factors so that the default handling of categorical values by the glm function will occur. It's possible that those authors already know for a fact that the underlying data-gathering process has encoded categorical data with numeric levels and the data input process was not "told" to process as categorical. That could have been done using the colClasses parameter to whichever read.* function was used.
The default handling of factors by most R regression routines uses the first level as part of the baseline (Intercept) estimate and estimates a coefficient for each of the other levels. If you had left that vector as numeric you would have gotten an estimate that could have been interpreted as the slope of an effect of an ordinal variable. The statistical tests associated with such an encoding of an ordinal relationship is often called a "linear test of trend" and is sometimes a useful result when the data situation in the "real world" can be interpreted as an ordinal relationship.

pool.compare generates non-comformable arguments error

Alternate title: Model matrix and set of coefficients show different numbers of variables
I am using the mice package for R to do some analyses. I wanted to compare two models (held in mira objects) using pool.compare(), but I keep getting the following error:
Error in model.matrix(formula, data) %*% coefs : non-conformable arguments
The binary operator %*% indicates matrix multiplication in R.
The expression model.matrix(formula, data) produces "The design matrix for a regression-like model with the specified formula and data" (from the R Documentation for model.matrix {stats}).
In the error message, coefs is drawn from est1$qbar, where est1 is a mipo object, and the qbar element is "The average of complete data estimates. The multiple imputation estimate." (from the documentation for mipo-class {mice}).
In my case
est1$qbar is a numeric vector of length 36
data is a data.frame with 918 observations of 82 variables
formula is class 'formula' containing the formula for my model
model.matrix(formula, data) is a matrix with dimension 918 x 48.
How can I resolve/prevent this error?
As occasionally happens, I found the answer to my own question while writing the question.
The clue I was that the estimates for categorical variables in est1.qbar only exist if that level of that variables was present in the data. Some of my variables are factor variables where not every level is represented. This caused the warning "contrasts dropped from factor variable name due to missing levels", which I foolishly ignored.
On the other hand, looking at dimnames(model.matrix.temp)[[2]] shows that the model matrix has one column for each level of each factor variable, regardless of whether that level of that variable was present in the data. So, although the contrasts for missing factor levels are dropped in terms of estimating the coefficients, those factor levels still appear in the model matrix. This means that the model matrix has more columns than the length of est1.qbar (the vector of estimated coefficients), so matrix multiplication is not going to work.
The answer here is to fix the factor variables so that there are no unused levels. This can be done with the factor() function (as explained here). Unfortunately, this needs to be done on the original dataset, prior to imputation.

random forest: error in dealing with factor levels in R

I am using rf model in R to predict a binary outcome 0 or 1. I have categorical variables (coded as numbers) in my input data which are coded as factor while training. I use factor() function in R to convert the variable as factor. So for every categorical variablex,my code is like this.
feature_x1=factor(feature_x1) # Convert the variable into factor in training data.
#This variable takes 3 levels 0,1,2
This works perfectly fine while training the model. Let us assume my model object is rf_model. While running the model on new data which is just a vector of numbers. I first convert the number into factors for feature_x1
newdata=data.frame(1,2)
colnames(newdata)=c("feature_x1","feature_x2")
newdata$feature_x1=factor(newdata$feature_x1)
score=pred(rf_model,newdata,type="prob")
I am receiving the following error
Error in predict.randomForest(rf_model, newdata,type = "prob") :
New factor levels not present in the training data
How to deal with this error, because in reality, after training the model we will always have to deal with data for which outcome is unknown which is a just a single record.
Please let me know if more clarity or code is required
Try
newdata$feature_x1 <- factor(newdata$feature_x1, levels=levels(feature_x1))

Resources