How to drop variables from a model created from the mlr package in R? - r

This is somewhat similar to the question I asked here. However, that question as zero answers and I think this question might be more fruitful in getting a response.
What I am trying to do is remove some features from an mlr created model, without having to fit the model again. For example, if we take the Boston data from the MASS library and create an mlr model, like so:
library(mlr)
library(MASS)
# Using the mlr package to train the data:
bTask <- makeRegrTask(data = Boston, target = "medv")
bLearn <- makeLearner("regr.randomForest")
bMod <- train(bLearn, bTask)
And then I use the task and trained model in some function, for example:
someFunc <- function(task, model){
pred <- predict(model, task)
pred <- pred$data$response
head(pred,10)
}
someFunc(bTask,bMod)
Everything works fine. But Im wondering if it's possible to remove some variables from bMod, without having to fit the mlr trained model again?
I know it's possible to drop features from the task using dropFeatures(), for example:
bTask1 <- dropFeatures(bTask, c("zn", "chas", "rad"))
But if I try to mix bTask1 and bMod like so:
pred1 <- predict(bMod, btask1)
I get the sensible error:
Error in predict.randomForest(.model$learner.model, newdata =
.newdata, : variables in the training data missing in newdata
Is there a way of dropping some features from the mlr created model (i.e, bMod) without fitting it again?

Related

Problem crating a Ranger model with R to use for MLflow

I am trying to use MLflow in R. According to https://www.mlflow.org/docs/latest/models.html#r-function-crate, the crate flavor needs to be used for the model. My model uses the Random Forest function implemented in the ranger package:
model <- ranger::ranger(formula = model_formula,
data = trainset,
importance = "impurity",
probability=T,
num.trees = 500,
mtry = 10)
The model itself works and I can do the prediction on a testset:
test_prediction <- predict(model, testset)
As a next step, I try to bring the model in the crate flavor. I follow here the approach shown in https://docs.databricks.com/_static/notebooks/mlflow/mlflow-quick-start-r.html.
predictor <- crate(function(x) predict(model,.x))
This results however in an error, when I apply the "predictor" on the testset
predictor(testset)
Error in predict(model, .x) : could not find function "predict"
Does anyone know how to solve this issue? To I have to transfer the prediction function differently in the crate function? Any help is highly appreciated ;-)
In my experience, that Databricks quickstart guide is wrong.
According to the Carrier documentation, you need to use explicit namespaces when calling non-base functions inside of crate. Since predict is actually part of the stats package, you'd need to specify stats::predict. Also, since your crate function depends on the global object named model, you'd need to pass that as an argument to the crate function as well.
Your code would end up looking something like this (I can't test it on your exact use case, since I don't have your data, but this works for me on MLflow in Databricks):
model <- ranger::ranger(formula = model_formula,
data = trainset,
importance = "impurity",
probability=T,
num.trees = 500,
mtry = 10)
predictor <- crate(function(x) {
stats::predict(model,x)
}, model = model)
predictor(testset)

How to get the confusion matrix with the statistics?

I have used the decision tree to predict my test set. After running my code I get a table which has the results, but I want to use the confusionMatrix() command from the caret library. I have tried several things, but none has worked. Please see my code:
library(rpart)
tree <- rpart(train$number ~ ., train, method = "class")
pred <- predict(tree,test, type ="class")
p <- predict(tree, type="class")
# Confusion Matrix
conf <- table(test$number, pred)
> conf
pred
Problem Reference
Problem 0 100
Reference 0 2782
I tried to do this:
p <- predict(tree, type="class")
confusionMatrix(p, entiredata$number)
Errors like data and reference should be the same type, so I changed it both to factors with as.factors(), then the arguments were not the same length. I searched the web and found similiar questions but they all didn't help me. My final goal is to receive the statistics as the accuracy.
library(caret)
confusionMatrix(p, test$number)
Since you predict only on the test data, you should compare predictions only on the test data, not the whole dataset.

Understanding how to use nnet in R

This is my first attempt using a machine learning paradigm in R. I'm using a planet data set (url: https://www.kaggle.com/mrisdal/open-exoplanet-catalogue) and I simply want to predict a planet's size based on the size of its Sun. This is the code I currently have, using nnet():
library(nnet)
#Organize data:
cols_to_keep = c(1,4,21)
full_data <- na.omit(read.csv('Planet_Data.csv')[, cols_to_keep])
#Split data:
train_data <- full_data[sample(nrow(full_data), round(nrow(full_data)/2)),]
rownames(train_data) <- 1:nrow(train_data)
test_data <- full_data[!rownames(full_data) %in% rownames(data1),]
rownames(test_data) <- 1:nrow(test_data)
#nnet
nnet_attempt <- nnet(RadiusJpt~HostStarRadiusSlrRad, data=train_data, size=0, linout=TRUE, skip=TRUE, maxNWts=10000, trace=FALSE, maxit=1000, decay=.001)
nnet_newdata <- predict(nnet_attempt, newdata=test_data)
nnet_newdata
When I print nnet_newdata I get a value for each row in my data, but I don't really understand what these values mean. Is this a proper way to use the nnet() package to predict a simple regression?
Thanks
When predict is called for an object with class nnet you will get, by default, the raw output from the nnet model applied to your new dataset. If, instead, yours is a classification problem, you can use type = "class".
See here.

Predict function from Caret package give an Error

I am doing just a regular logistic regression using the caret package in R. I have a binomial response variable coded 1 or 0 that is called a SALES_FLAG and 140 numeric response variables that I used dummyVars function in R to transform to dummy variables.
data <- dummyVars(~., data = data_2, fullRank=TRUE,sep="_",levelsOnly = FALSE )
dummies<-(predict(data, data_2))
model_data<- as.data.frame(dummies)
This gives me a data frame to work with. All of the variables are numeric. Next I split into training and testing:
trainIndex <- createDataPartition(model_data$SALE_FLAG, p = .80,list = FALSE)
train <- model_data[ trainIndex,]
test <- model_data[-trainIndex,]
Time to train my model using the train function:
model <- train(SALE_FLAG~. data=train,method = "glm")
Everything runs nice and I get a model. But when I run the predict function it does not give me what I need:
predict(model, newdata =test,type="prob")
and I get an ERROR:
Error in dimnames(out)[[2]] <- modelFit$obsLevels :
length of 'dimnames' [2] not equal to array extent
On the other hand when I replace "prob" with "raw" for type inside of the predict function I get prediction but I need probabilities so I can code them into binary variable given my threshold.
Not sure why this happens. I did the same thing without using the caret package and it worked how it should:
model2 <- glm(SALE_FLAG ~ ., family = binomial(logit), data = train)
predict(model2, newdata =test, type="response")
I spend some time looking at this but not sure what is going on and it seems very weird to me. I have tried many variations of the train function meaning I didn't use the formula and used X and Y. I used method = 'bayesglm' as well to check and id gave me the same error. I hope someone can help me out. I don't need to use it since the train function to get what I need but caret package is a good package with lots of tools and I would like to be able to figure this out.
Show us str(train) and str(test). I suspect the outcome variable is numeric, which makes train think that you are doing regression. That should also be apparent from printing model. Make it a factor if you want to do classification.
Max

Caret - Some PreProcessing Options Not Available in Train

In caret::train there are many pre-processing options that can be passed via the 'preProcessing' argument. This makes life super-simple because the test data is then auto-magically pre-processed in the same manner as the training data when calling 'predict.train'. Is it possible to do the same with 'findCorrelation' and 'nearZeroVar' in some manner?
I clearly understand from the documentation why the following code does not work, but I am hoping this clarifies my question. Ideally, I could do the following.
library("caret")
set.seed (1234)
data (iris)
# split test vs training
train.index <- createDataPartition (y = iris[,5], p = 0.80, list = F)
train <- iris [ train.index, ]
test <- iris [-train.index, ]
# train the model after imputing the missing data
fit <- train (Species ~ .,
train,
preProcess = c("findCorrelation", "nearZeroVar"),
method = "rpart" )
predict (fit, test)
Right now, you are tied to whatever preProcess will do.
However, the next version (around the start of the year, I hope) will allow you to more easily write custom models and pre-processing. For example, you might want to down-sample the data etc.
Let me know if you would like to test that version when we have a beta availible.
Max

Resources