Model Error in Using R Caret with numeric data - r

I am new to R. I got a error while using Caret.
# load the library
library(mlbench)
library(caret)
mydata2 <-mydata[1:200,c(52, 56:59)]
mydata2
# load the dataset
# prepare training scheme
control <- trainControl(method="lm", number=10, repeats=3)
# train the model
model <- train(MtrRegActNetEngyDailyKwh~., data=mydata2,method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
But, the result shows nothing..
plot shows nothing..
Example of my data:
structure(list(MtrRegActNetEngyDailyKwh = c(16.736, 18.093),
Building = c(6, 6), numberofpeople = c(5, 5), pool = c(2,
2), typeofAC = c(1, 1)), row.names = 1:2, class = "data.frame")
I am not sure why the model does not work... Can get some help?
Update:
I tried following code. It works.
model_nnet<-train(trainSetSmall[,predictors],trainSetSmall[,outcomeName],method='nnet')
importance <- varImp(model_nnet, scale=FALSE)
plot(importance)
I also want to test it in 'gbm' model.
model_gbm<-train(trainSetSmall[,predictors], trainSetSmall[,outcomeName],method='gbm')
importance2 <- varImp(model_gbm, scale=FALSE)
But, I got an error message..
Error Message: > importance2 <- varImp(model_gbm, scale=FALSE)
Error in relative.influence(object, n.trees = numTrees) :
could not find function "relative.influence"
I am not sure why it does not work... I just want to use other model to test again. Can I get some help?

As the error states, you are using the wrong kind of model for your data. Learning Vector Quantization is only for classification modelling, not regression modelling. You need to select a different model given your data. See this page of the caret documentation for all the available models in caret. Filter on regression to see all the regression models.

Related

unable to train the randomforest model in R

I'm trying to train my dataset using R. Following is the code that I'll be using
functionRankFeatureByImportance <- function(logwine_withoutQuality){
#logwine_withoutQuality$quality<-factor(logwine_withoutQuality$quality)
# ensure results are repeatable
set.seed(7)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(logwine_withoutQuality[,-12],
logwine_withoutQuality$quality, method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)
}
But when using this I'm getting an error like below.
I'm unable to understand what my error is.
Following is an image of the dataset I'm using
There aren't any null values in the dataframe.
Really appreciate if someone can kindly help me to solve this

Difference between Model and $FinalModel for classification in R?

Currently got this Random Forest model, just seeing how well it predicts those with diabetes positive or diabetes negative
Model is calculated using the caret workflow
when looking at variable importance i was told to use the code
randomForest::importance(model$finalModel)
what is the purpose of $finalModel? what is $finalModel as compared to just the original model? should it not be just be the original model passed in as the argument instead to view variable importance?
example below:
library(tidyverse)
library(mlbench)
library(caret)
library(car)
library(glmnet)
library(rpart.plot)
library(rpart)
data("PimaIndiansDiabetes2")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples,]
test.data <- PimaIndiansDiabetes2[-training.samples,]
model_rf <- caret::train(
diabetes ~.,
data = train.data,
method = "rf",
trControl = trainControl("cv", number = 10),
importance = TRUE)
model_rf
model_rf$bestTune
model_rf$finalModel
# variable importance here
importance(model_rf$finalModel)
From the documentation:
finalModel A fit object using the best parameters
Most of the times with train you pass some different values for hyper-parameter estimation, to find the values that achieve the best performance (using trainControl).
Inside model_rf you'll find under finalModel the model built with the best parameters.
FYI caret also has a function for variable importance plotting: varImp.

Does predict.H2OModel() from h2o package in R give OOB predictions for h2o.randomForest() models?

I can't tell from documentation whether or not the predict.H2OModel() function from the h2o package in R gives OOB predictions for random forest models built using h2o.randomForest().
In fact, in the 3-4 examples I've tried, it seems the results of predict.H2OModel() are closer to the non-OOB predictions from predict.randomForest() from the randomForest package than the OOB ones.
Does anyone know if they are OOB predictions? If not, do you know how to get OOB predictions for h2o.randomForest() models?
Example:
set.seed(123)
library(randomForest)
library(h2o)
data(mtcars)
d = mtcars[,c('mpg', 'cyl', 'disp', 'hp', 'wt' )]
## define some common settings for both random forests
n.trees=1000
mtry = 3
min.node = 3
## prep for h2o.randomForest
h2o.init()
d.h2o= as.h2o(d)
x.names = colnames(d)[2:5] ## predictors
## fit both models
set.seed(123);
rf = randomForest(mpg ~ ., data = d , ntree=n.trees, mtry = mtry, nodesize=min.node)
h2o = h2o.randomForest(y='mpg', x=x.names, training_frame = d.h2o, ntrees=n.trees, mtries = mtry, min_rows=min.node)
## Correct way and incorrect way of getting OOB predictions for a randomForest model. Not sure about h2o model.
d$rf.oob.pred = predict(rf) ## Gives OOB predictions
d$rf.pred = predict(rf , newdata=d ) ## Doesn't give OOB predictions.
d$h2o.pred = as.vector(predict(h2o, newdata=d.h2o)) ## Not sure if this is OOB or not.
## d$h2o.pred seems more similar to d$rf.pred than d$rf.oob.pred,
## suggesting that predict.H2OModel() might not give OOB predictions.
mean((d$rf.pred - d$h2o.pred)^2)
mean((d$rf.oob.pred - d$h2o.pred)^2)
H2O's h2o.predict() does not provide predictions for the OOB data. You have to specify what dataset you want to predict with the newdata = parameter. So when you have newdata=d.h2o then you are getting the predictions for the d.h2o dataframe you've specified.
Currently there is no method to get the prediction for the oob data. However, there is a jira ticket to specify whether you would like oob metrics (note this ticket also links to another ticket which helps clarify how training metrics are currently being reported for Random Forest).

Pooled Regression Results using mice, caret, and glmnet

Not sure if this more of a statistics question but the closest similar problem I could find is here, although I couldn't get it to work for my case.
I am trying to develop a pooled, penalized logistic regression model. I used mice to create a mids object and then fit a model to each dataset using caret repeated cross-validation with elastic net regression (glmnet) to tune parameters. The fitted object is not of class "mira" but I think I fixed that by changing the object class with the right list items. The major issue is that glmnet does not have an associated vcov method, which is required by pool().
I would like to use penalized regression based on the amount of variables and uncertainty over which ones are the best predictors. My data consists of 4x numeric variables and 9x categorical variables of varying levels and I anticipate including interactions.
Does anyone know how I might be able to create my own vcov method or otherwise address this issue? I am not sure if this is possible.
Example data and code are enclosed, noting that I am not able to share the actual data.
library(mice)
library(caret)
dat <- as.data.frame(list(time=c(4,3,1,1,2,2,3,5,2,4,5,1,4,3,1,1,2,2,3,5,2,4,5,1),
status=c(1,1,1,0,2,2,0,0,NA,1,2,0,1,1,1,NA,2,2,0,0,1,NA,2,0),
x=c(0,2,1,1,NA,NA,0,1,1,2,0,1,0,2,1,1,NA,NA,0,1,1,2,0,1),
sex=c("M","M","M","M","F","F","F","F","M","F","F","M","F","M","M","M","F","F","M","F","M","F","M","F")))
imp <- mice(dat,m=5, seed=192)
control = trainControl(method = "repeatedcv",
number = 10,
repeats=3,
verboseIter = FALSE)
mod <- list(analyses=vector("list", imp$m))
for(i in 1:imp$m){
mod$analyses[[i]] <- train(sex ~ .,
data = complete(imp, i),
method = "glmnet",
family="binomial",
trControl = control,
tuneLength = 10,
metric="Kappa")
}
obj <- as.mira(mod)
obj <- list(call=mod$analyses[[1]]$call, call1=imp$call, nmis=imp$nmis, analyses=mod$analyses)
oldClass(obj) <- "mira"
pool(obj)
Produces:
Error in pool(obj) : Object has no vcov() method.

Calculate Prediction Intervals of a predicted value using Caret package of R

I used different neural network packages within Caret package for my predictions. Code used with nnet package is
library(caret)
# training model using nnet method
data <- na.omit(data)
xtrain <- data[,c("temperature","prevday1","prevday2","prev_instant1","prev_instant2","prev_2_hour")]
ytrain <- data$power
train_model <- train(x = xtrain, y = ytrain, method = "nnet", linout=TRUE, na.action = na.exclude,trace=FALSE)
# prediction using training model created
pred_ob <- predict(train_model, newdata=dframe,type="raw")
The predict function simply calculates the prediction value. But, I also need prediction intervals (2-sigma) as well. On searching, I found a relevant answer at stackoverflow link, but this does not result as needed. The solution suggests to use finalModelvariable as
predict(train_model$finalModel, newdata=dframe, interval = "confidence",type=raw)
Is there any other way to calculate prediction intervals? The training data used is the dput() of my previous question at stackoverflow link and the dput() of my prediction dataframe (test data) is
dframe <- structure(list(temperature = 27, prevday1 = 1607.69296666667,
prevday2 = 1766.18103333333, prev_instant1 = 1717.19306666667,
prev_instant2 = 1577.168915, prev_2_hour = 1370.14983583333), .Names = c("temperature",
"prevday1", "prevday2", "prev_instant1", "prev_instant2", "prev_2_hour"
), class = "data.frame", row.names = c(NA, -1L))
****************************UPDATE***********************
I used nnetpredintpackage as suggested at link. To my surprise it results in an error, which I find difficult to debug. Here is my updated code till now,
library(nnetpredint)
nnetPredInt(train_model, xTrain = xtrain, yTrain = ytrain,newData = dframe)
It results in the following error:
Error: Number of observations for xTrain, yTrain, yFit are not the same
[1] 0
I can check that xtrain, ytrain and dframe are with correct dimensions, but I do not have any idea about yFit. I don't need this according to the examples of nnetpredintvignette
caret doesn't generate prediction intervals; that relies on the individual package. If that package cannot do this, then neither can the train objects. I agree that nnetPredInt is the appropriate way to go.
Two other notes:
you most likely should center and scale your data if you have not already.
using the finalModel object is somewhat dangerous since it has no idea what was done to the data (e.g. dummy variables, centering and scale or other preprocessing methods, etc) before it was created.
Max
Thanks for your question. And a simple answer to your problem is: Right now the nnetPredInt function only support the following S3 object, "nnet", "nn" and "rsnns", produced by different neural network packages. And the train function in caret package return an "train" object. That's why the function nnetPredInt doesn't get the yFit vectors, which is the fitted.value of the training datasets, from your train_model.
1.Quick way to use the model from caret package:
Get the finalModel result from the 'train' object:
nnetObj = train_model$finalModel # return the 'nnet' model which the caret package has found.
yPredInt = nnetPredInt(nnetObj, xTrain = xtrain, yTrain = ytrain,newData = dframe)
For Example, Use the Iris Dataset and the 'nnet' method from caret package for regression prediction.
library(caret)
library(nnetpredint)
# Setosa 0 and Versicolor 1
ird <- data.frame(rbind(iris3[,,1], iris3[,,2]), species = c(rep(0, 50), rep(1, 50)))
samp = sample(1:100, 80)
xtrain = ird[samp,][1:4]
ytrain = ird[samp,]$species
# Training
train_model <- train(x = xtrain, y = ytrain, method = "nnet", linout = FALSE, na.action = na.exclude,trace=FALSE)
class(train_model) # [1] "train"
nnetObj = train_model$finalModel
class(nnetObj) # [1] "nnet.formula" "nnet"
# Constructing Prediction Interval
xtest = ird[-samp,][1:4]
ytest = ird[-samp,]$species
yPredInt = nnetPredInt(nnetObj, xTrain = xtrain, yTrain = ytrain,newData = xtest)
# Compare Results: ytest and yPredInt
ytest
yPredInt
2.The Hard Way
Use the generic nnetPredInt function to pass all the neural net specific parameters to the function:
nnetPredInt(object = NULL, xTrain, yTrain, yFit, node, wts, newData,alpha = 0.05 , lambda = 0.5, funName = 'sigmoid', ...)
xTrain # Training Dataset
yTrain # Training Target Value
yFit # Fitted Value of the training data
node # Structure of your network, like c(4,5,5,1)
wts # Specific order of weights parameters found by your neural network
newData # New Data for prediction
Tips:
Right now nnetpredint package only support the standard multilayer neural network regression with activated output, not the linear output,
And it will support more type of models soon in the future.
You can use the nnetPredInt function {package:nnetpredint}. Check out the function's help page here
If you are open to writing your own implementation there is another option. You can get prediction intervals from a trained net using the same implementation you would write for standard non-linear regression (assuming back propagation was used to do the estimation).
This paper goes through the methodology and is fairly straight foward: http://www.cis.upenn.edu/~ungar/Datamining/Publications/yale.pdf.
There are, as with everything,some cons (outlined in the paper) to this approach but definitely worth knowing as an option.

Resources