Error when using predict() on a randomForest object trained with caret's train() using formula - r

Using R 3.2.0 with caret 6.0-41 and randomForest 4.6-10 on a 64-bit Linux machine.
When trying to use the predict() method on a randomForest object trained with the train() function from the caret package using a formula, the function returns an error.
When training via randomForest() and/or using x= and y= rather than a formula, it all runs smoothly.
Here is a working example:
library(randomForest)
library(caret)
data(imports85)
imp85 <- imports85[, c("stroke", "price", "fuelType", "numOfDoors")]
imp85 <- imp85[complete.cases(imp85), ]
imp85[] <- lapply(imp85, function(x) if (is.factor(x)) x[,drop=TRUE] else x) ## Drop empty levels for factors.
modRf1 <- randomForest(numOfDoors~., data=imp85)
caretRf <- train( numOfDoors~., data=imp85, method = "rf" )
modRf2 <- caretRf$finalModel
modRf3 <- randomForest(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"])
caretRf <- train(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"], method = "rf")
modRf4 <- caretRf$finalModel
p1 <- predict(modRf1, newdata=imp85)
p2 <- predict(modRf2, newdata=imp85)
p3 <- predict(modRf3, newdata=imp85)
p4 <- predict(modRf4, newdata=imp85)
Among the last 4 lines, only the second one p2 <- predict(modRf2, newdata=imp85) returns the following error:
Error in predict.randomForest(modRf2, newdata = imp85) :
variables in the training data missing in newdata
It seems that the reason for this error is that the predict.randomForest method uses rownames(object$importance) to determine the name of the variables used to train the random forest object. And when looking at
rownames(modRf1$importance)
rownames(modRf2$importance)
rownames(modRf3$importance)
rownames(modRf4$importance)
We see:
[1] "stroke" "price" "fuelType"
[1] "stroke" "price" "fuelTypegas"
[1] "stroke" "price" "fuelType"
[1] "stroke" "price" "fuelType"
So somehow, when using the caret train() function with a formula changes the name of the (factor) variables in the importance field of the randomForest object.
Is it really an inconsistency between the formula and and non-formula version of the caret train() function? Or am I missing something?

First, almost never use the $finalModel object for prediction. Use predict.train. This is one good example of why.
There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.
So randomForest will not create dummy variables when you use randomForest(y ~ ., data = dat) but train (and most others) will using a call like train(y ~ ., data = dat).
The error occurs because fuelType is a factor. The dummy variables created by train don't have the same names so predict.randomForest can't find them.
Using the non-formula method with train will pass the factor predictors to randomForest and everything will work.
TL;DR
Use the non-formula method with train if you want the same levels or use predict.train

There can be two reasons why you get this error.
1. The categories of the categorical variables in the train and test sets don't match. To check that, you can run something like the following.
Well, first of all, it is good practice to keep the independent variables/features in a list. Say that list is "vars". And say, you separated "Data" into "Train" and "Test". Let's go:
for (v in vars){
if (class(Data[,v]) == 'factor'){
print(v)
# print(levels(Train[,v]))
# print(levels(Test[,v]))
print(all.equal(levels(Train[,v]) , levels(Test[,v])))
}
}
Once you find the non-matching categorical variables, you can go back, and impose the categories of Test data onto Train data, and then re-build your model. In a loop similar to above, for each nonMatchingVar, you can do
levels(Test$nonMatchingVar) <- levels(Train$nonMatchingVar)
2. A silly one. If you accidentally leave the dependent variable in the set of independent variables, you may run into this error message. I have done that mistake. Solution: Just be more careful.

Another way is to explicitly code the testing data using model.matrix, e.g.
p2 <- predict(modRf2, newdata=model.matrix(~., imp85))

Related

How can I include both my categorical and numeric predictors in my elastic net model? r

As a note beforehand, I think I should mention that I am working with highly sensitive medical data that is protected by HIPAA. I cannot share real data with dput- it would be illegal to do so. That is why I made a fake dataset and explained my processes to help reproduce the error.
I have been trying to estimate an elastic net model in r using glmnet. However, I keep getting an error. I am not sure what is causing it. The error happens when I go to train the data. It sounds like it has something to do with the data type and matrix.
I have provided a sample dataset. Then I set the outcomes and certain predictors to be factors. After setting certain variables to be factors, I label them. Next, I create an object with the column names of the predictors I want to use. That object is pred.names.min. Then I partition the data into the training and test data frames. 65% in the training, 35% in the test. With the train control function, I specify a few things I want to have happen with the model- random paraments for lambda and alpha, as well as the leave one out method. I also specify that it is a classification model (categorical outcome). In the last step, I specify the training model. I write my code to tell it to use all of the predictor variables in the pred.names.min object for the trainingset data frame.
library(dplyr)
library(tidyverse)
library(glmnet),0,1,0
library(caret)
#creating sample dataset
df<-data.frame("BMIfactor"=c(1,2,3,2,3,1,2,1,3,2,1,3,1,1,3,2,3,2,1,2,1,3),
"age"=c(0,4,8,1,2,7,4,9,9,2,2,1,8,6,1,2,9,2,2,9,2,1),
"L_TartaricacidArea"=c(0,1,1,0,1,1,1,0,0,1,0,1,1,0,1,0,0,1,1,0,1,1),
"Hydroxymethyl_5_furancarboxylicacidArea_2"=
c(1,1,0,1,0,0,1,0,1,1,0,1,1,0,1,1,0,1,0,1,0,1),
"Anhydro_1.5_D_glucitolArea"=
c(8,5,8,6,2,9,2,8,9,4,2,0,4,8,1,2,7,4,9,9,2,2),
"LevoglucosanArea"=
c(6,2,9,2,8,6,1,8,2,1,2,8,5,8,6,2,9,2,8,9,4,2),
"HexadecanolArea_1"=
c(4,9,2,1,2,9,2,1,6,1,2,6,2,9,2,8,6,1,8,2,1,2),
"EthanolamineArea"=
c(6,4,9,2,1,2,4,6,1,8,2,4,9,2,1,2,9,2,1,6,1,2),
"OxoglutaricacidArea_2"=
c(4,7,8,2,5,2,7,6,9,2,4,6,4,9,2,1,2,4,6,1,8,2),
"AminopentanedioicacidArea_3"=
c(2,5,5,5,2,9,7,5,9,4,4,4,7,8,2,5,2,7,6,9,2,4),
"XylitolArea"=
c(6,8,3,5,1,9,9,6,6,3,7,2,5,5,5,2,9,7,5,9,4,4),
"DL_XyloseArea"=
c(6,9,5,7,2,7,0,1,6,6,3,6,8,3,5,1,9,9,6,6,3,7),
"ErythritolArea"=
c(6,7,4,7,9,2,5,5,8,9,1,6,9,5,7,2,7,0,1,6,6,3),
"hpresponse1"=
c(1,0,1,1,0,1,1,0,0,1,0,0,1,0,1,1,1,0,1,0,0,1),
"hpresponse2"=
c(1,0,1,0,0,1,1,1,0,1,0,1,0,1,1,0,1,0,1,0,0,1))
#setting variables as factors
df$hpresponse1<-as.factor(df$hpresponse1)
df$hpresponse2<-as.factor(df$hpresponse2)
df$BMIfactor<-as.factor(df$BMIfactor)
df$L_TartaricacidArea<- as.factor(df$L_TartaricacidArea)
df$Hydroxymethyl_5_furancarboxylicacidArea_2<-
as.factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2)
#labeling factor levels
df$hpresponse1 <- factor(df$hpresponse1, labels = c("group1.2", "group3.4"))
df$hpresponse2 <- factor(df$hpresponse2, labels = c("group1.2.3", "group4"))
df$L_TartaricacidArea <- factor(df$L_TartaricacidArea, labels =c ("No",
"Yes"))
df$Hydroxymethyl_5_furancarboxylicacidArea_2 <-
factor(df$Hydroxymethyl_5_furancarboxylicacidArea_2, labels =c ("No",
"Yes"))
df$BMIfactor <- factor(df$BMIfactor, labels = c("<40", ">=40and<50",
">=50"))
#creating list of predictor names
pred.start.min <- which(colnames(df) == "BMIfactor"); pred.start.min
pred.stop.min <- which(colnames(df) == "ErythritolArea"); pred.stop.min
pred.names.min <- colnames(df)[pred.start.min:pred.stop.min]
#partition data into training and test (65%/35%)
set.seed(2)
n=floor(nrow(df)*0.65)
train_ind=sample(seq_len(nrow(df)), size = n)
trainingset=df[train_ind,]
testingset=df[-train_ind,]
#specifying that I want to use the leave one out cross-
#validation method and
use "random" as search for elasticnet
tcontrol <- trainControl(method = "LOOCV",
search="random",
classProbs = TRUE)
#training model
elastic_model1 <- train(as.matrix(trainingset[,
pred.names.min]),
trainingset$hpresponse1,
data = trainingset,
method = "glmnet",
trControl = tcontrol)
After I run the last chunk of code, I end up with this error:
Error in { :
task 1 failed - "error in evaluating the argument 'x' in selecting a
method for function 'as.matrix': object of invalid type "character" in
'matrix_as_dense()'"
In addition: There were 50 or more warnings (use warnings() to see the first
50)
I tried removing the "as.matrix" arguemtent:
elastic_model1 <- train((trainingset[, pred.names.min]),
trainingset$hpresponse1,
data = trainingset,
method = "glmnet",
trControl = tcontrol)
It still produces a similar error.
Error in { :
task 1 failed - "error in evaluating the argument 'x' in selecting a method
for function 'as.matrix': object of invalid type "character" in
'matrix_as_dense()'"
In addition: There were 50 or more warnings (use warnings() to see the first
50)
When I tried to make none of the predictors factors (but keep outcome as factor), this is the error I get:
Error: At least one of the class levels is not a valid R variable name; This
will cause errors when class probabilities are generated because the
variables names will be converted to X0, X1 . Please use factor levels that
can be used as valid R variable names (see ?make.names for help).
How can I fix this? How can I use my predictors (both the numeric and categorical ones) without producing an error?
glmnet does not handle factors well. The recommendation currently is to dummy code and re-code to numeric where possible:
Using LASSO in R with categorical variables

Using a For Loop to Run Multiple Response Variables through a Train function to create multiple seperate models in R

I am trying to create a for loop to index thorugh each individual response variable I have and train a model using the train() funciton within the Caret Package. I have about 30 response variable and 43 predictor variables. I can train each model individually but I would like to automate the process and have a for loop run through a model (I would like to eventually upscale to multiple models if possible, i.e. lm, rf, cubist, etc.). I then want to save each model to a dataframe along with R-squared values and RMSE values. The individual models that I currenlty have that will run for me goes as follows, with column 11 being the response variable and column 35-68 being predictor variables.
data_Mg <- subset(data_3, !is.na(Mg))
mg.lm <- train(Mg~., data=data_Mg[,c(11,35:68)], method="lm", trControl=control)
mg.cubist <- train(Mg~., data=data_Mg[,c(11,35:68)], method="cubist", trControl=control)
mg.rf <- train(Mg~., data=data_Mg[,c(11,35:68)], method="rf", trControl=control, na.action = na.roughfix)
max(mg.lm$results$Rsquared)
min(mg.lm$results$RMSE)
max(mg.cubist$results$Rsquared)
min(mg.cubist$results$RMSE)
max(mg.rf$results$Rsquared) #Highest R squared
min(mg.rf$results$RMSE)
This gives me 3 models with everything the relevant information that I need. Now for the for loop. I've only tried the lm model so far for this.
bucket <- list()
for(i in 1:ncol(data_4)) { # for-loop response variables, need to end it at response variables, rn will run through all variables
data_y<-subset(data_4, !is.na(i))#get rid of NA's in the "i" column
predictors_i <- colnames(data_4)[i] # Create vector of predictor names
predictors_1.1 <- noquote(predictors_i)
i.lm <- train(predictors_1.1~., data=data_4[,c(i,35:68)], method="lm", trControl=control)
bucket <- i.lm
#mod_summaries[[i - 1]] <- summary(lm(y ~ ., data_y[ , c("i.lm", predictors_i)]))
#data_y <- data_4
}
Below is the error code that I am getting, with Bulk_Densi being the first variable in predictors_1.1. The error code is that variable lengths differ so I originally thought that my issue was that quotes were being added around "Bulk_Densi" but after trying the NoQuote() function I have not gotten anywehre so I am unsure of where I am going wrong.
Error code that I am getting
Please let me know if I can provide any extra info and thanks in advance for the help! I've already tried the info in How to train several models within a loop for and was struggling with that as well.

Extracting predictions from a GAM model with splines and lagged predictors

I have some data and am trying to teach myself about utilize lagged predictors within regression models. I'm currently trying to generate predictions from a generalized additive model that uses splines to smooth the data and contains lags.
Let's say I have the following data and have split the data into training and test samples.
head(mtcars)
Train <- sample(1:nrow(mtcars), ceiling(nrow(mtcars)*3/4), replace=FALSE)
Great, let's train the gam model on the training set.
f_gam <- gam(hp ~ s(qsec, bs="cr") + s(lag(disp, 1), bs="cr"), data=mtcars[Train,])
summary(f_gam)
When I go to predict on the holdout sample, I get an error message.
f_gam.pred <- predict(f_gam, mtcars[-Train,]); f_gam.pred
Error in ExtractData(object, data, NULL) :
'names' attribute [1] must be the same length as the vector [0]
Calls: predict ... predict.gam -> PredictMat -> Predict.matrix3 -> ExtractData
Can anyone help diagnose the issue and help with a solution. I get that lag(__,1) leaves a data point as NA and that is likely the reason for the lengths being different. However, I don't have a solution to the problem.
I'm going to assume you're using gam() from the mgcv library. It appears that gam() doesn't like functions that are not defined in "base" in the s() terms. You can get around this by adding a column which include the transformed variable and then modeling using that variable. For example
tmtcars <- transform(mtcars, ldisp=lag(disp,1))
Train <- sample(1:nrow(mtcars), ceiling(nrow(mtcars)*3/4), replace=FALSE)
f_gam <- gam(hp ~ s(qsec, bs="cr") + s(ldisp, bs="cr"), data= tmtcars[Train,])
summary(f_gam)
predict(f_gam, tmtcars[-Train,])
works without error.
The problem appears to be coming from the mgcv:::get.var function. It tires to decode the terms with something like
eval(parse(text = txt), data, enclos = NULL)
and because they explicitly set the enclosure to NULL, variable and function names outside of base cannot be resolved. So because mean() is in the base package, this works
eval(parse(text="mean(x)"), data.frame(x=1:4), enclos=NULL)
# [1] 2.5
but because var() is defined in stats, this does not
eval(parse(text="var(x)"), data.frame(x=1:4), enclos=NULL)
# Error in eval(expr, envir, enclos) : could not find function "var"
and lag(), like var() is defined in the stats package.

Predict function from Caret package give an Error

I am doing just a regular logistic regression using the caret package in R. I have a binomial response variable coded 1 or 0 that is called a SALES_FLAG and 140 numeric response variables that I used dummyVars function in R to transform to dummy variables.
data <- dummyVars(~., data = data_2, fullRank=TRUE,sep="_",levelsOnly = FALSE )
dummies<-(predict(data, data_2))
model_data<- as.data.frame(dummies)
This gives me a data frame to work with. All of the variables are numeric. Next I split into training and testing:
trainIndex <- createDataPartition(model_data$SALE_FLAG, p = .80,list = FALSE)
train <- model_data[ trainIndex,]
test <- model_data[-trainIndex,]
Time to train my model using the train function:
model <- train(SALE_FLAG~. data=train,method = "glm")
Everything runs nice and I get a model. But when I run the predict function it does not give me what I need:
predict(model, newdata =test,type="prob")
and I get an ERROR:
Error in dimnames(out)[[2]] <- modelFit$obsLevels :
length of 'dimnames' [2] not equal to array extent
On the other hand when I replace "prob" with "raw" for type inside of the predict function I get prediction but I need probabilities so I can code them into binary variable given my threshold.
Not sure why this happens. I did the same thing without using the caret package and it worked how it should:
model2 <- glm(SALE_FLAG ~ ., family = binomial(logit), data = train)
predict(model2, newdata =test, type="response")
I spend some time looking at this but not sure what is going on and it seems very weird to me. I have tried many variations of the train function meaning I didn't use the formula and used X and Y. I used method = 'bayesglm' as well to check and id gave me the same error. I hope someone can help me out. I don't need to use it since the train function to get what I need but caret package is a good package with lots of tools and I would like to be able to figure this out.
Show us str(train) and str(test). I suspect the outcome variable is numeric, which makes train think that you are doing regression. That should also be apparent from printing model. Make it a factor if you want to do classification.
Max

Predicting with lm object in R - black box paradigm

I have a function that returns an lm object. I want to produce predicted values based on some new data. The new data is a data.frame in the exact format as the data passed to the lm function, except that the response has been removed (since we're predicting, not training). I would expect to execute the following, but get an error:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
In my case, ModelResponse was the name of the response column in the data I originally trained on. So just for kicks, I tried to insert NA reponse:
newdata$ModelResponse = NA
predict( model , newdata )
Error in terms.default(object, data = data) : no terms component nor attribute
Highly frustrating! R's notion of models/regression doesn't match mine: 1. I train a model with some data and get a model object. 2. I can score new data from any environment/function/frame/etc. so long as I input data into the model object that "looks like" the data I trained on (i.e. same column names). This is a standard black-box paradigm.
So here are my questions:
1. What concept(s) am I missing here?
2. How do I get my scenario to work?
3. How can I get model object to be portable? str(model) shows me that the model object saved the original data it trained on! So the model object is massive. I want my model to be portable to any function/environment/etc. and only contain the data it needs to score.
In the absence of str() on either the model or the data offered to the model, here's my guess regarding this error message:
predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"
I guess that you made a model object named "model" and that your outcome variable (the left-hand-side of the formula( in the original call to lm was named "ModelResponse" and that you then named a column in newdata by the same name. But what you should have done was leave out the "ModelResponse" columns (because that is what you are predicting) and put in the "Model_Predictor1", Model_Predictor2", etc. ... i.e. all the names on the right-hand-side of the formula given to lm()
The coef() function will allow you to extract the information needed to make the model portable.
mod.coef <- coef(model)
mod.coef
Since you expressed interest in the rms/Hmisc package combo Function, here it is using the help-example from ols and comparing the output with an extracted function and the rms Predict method. Note the capitals, since these are designed to work with the package equivalents of lm and glm(..., family="binomial") and coxph, which in rms become ols, lrm, and cph.
> set.seed(1)
> x1 <- runif(200)
> x2 <- sample(0:3, 200, TRUE)
> distance <- (x1 + x2/3 + rnorm(200))^2
> d <- datadist(x1,x2)
> options(datadist="d") # No d -> no summary, plot without giving all details
>
>
> f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
>
> Function(f)
function(x1 = 0.50549065,x2 = 1) {0.50497361+1.0737604* x1-
0.79398383*pmax(x1-0.083887788,0)^3+ 1.4392827*pmax(x1-0.38792825,0)^3-
0.38627901*pmax(x1-0.65115162,0)^3-0.25901986*pmax(x1-0.92736774,0)^3+
0.06374433*x2+ 0.60885222*(x2==2)+0.38971577*(x2==3) }
<environment: 0x11b4568e8>
> ols.fun <- Function(f)
> pred1 <- Predict(f, x1=1, x2=3)
> pred1
x1 x2 yhat lower upper
1 1 3 1.862754 1.386107 2.339401
Response variable (y): sqrt(distance)
Limits are 0.95 confidence limits
# The "yhat" is the same as one produces with the extracted function
> ols.fun(x1=1, x2=3)
[1] 1.862754
(I have learned through experience that the restricted cubic-spline fit functions coming from rms need to have spaces and carriage returns added to improve readability. )
Thinking long-term, you should probably take a look at the caret package. Many or most modeling functions work with data frames and matrices, others have a preference, and there may be other variations of their expectations. It's important to quickly get your head around each, but if you want a single wrapper that will simplify life for you, making the intricacies into a "black box", then caret is as close as you can get.
As a disclaimer: I do not use caret, as I don't think modeling should be a be a black box. I've had more than a few emails to maintainers of modeling packages resulting from looking into their code and seeing something amiss. Wrapping that in another layer would not serve my interests. So, in the very long-run, avoid caret and develop an enjoyment for dissecting what's going into and out of the different modeling functions. :)

Resources