I am trying to use MLflow in R. According to https://www.mlflow.org/docs/latest/models.html#r-function-crate, the crate flavor needs to be used for the model. My model uses the Random Forest function implemented in the ranger package:
model <- ranger::ranger(formula = model_formula,
data = trainset,
importance = "impurity",
probability=T,
num.trees = 500,
mtry = 10)
The model itself works and I can do the prediction on a testset:
test_prediction <- predict(model, testset)
As a next step, I try to bring the model in the crate flavor. I follow here the approach shown in https://docs.databricks.com/_static/notebooks/mlflow/mlflow-quick-start-r.html.
predictor <- crate(function(x) predict(model,.x))
This results however in an error, when I apply the "predictor" on the testset
predictor(testset)
Error in predict(model, .x) : could not find function "predict"
Does anyone know how to solve this issue? To I have to transfer the prediction function differently in the crate function? Any help is highly appreciated ;-)
In my experience, that Databricks quickstart guide is wrong.
According to the Carrier documentation, you need to use explicit namespaces when calling non-base functions inside of crate. Since predict is actually part of the stats package, you'd need to specify stats::predict. Also, since your crate function depends on the global object named model, you'd need to pass that as an argument to the crate function as well.
Your code would end up looking something like this (I can't test it on your exact use case, since I don't have your data, but this works for me on MLflow in Databricks):
model <- ranger::ranger(formula = model_formula,
data = trainset,
importance = "impurity",
probability=T,
num.trees = 500,
mtry = 10)
predictor <- crate(function(x) {
stats::predict(model,x)
}, model = model)
predictor(testset)
Related
I'm trying to build a regression model with R using lightGBM,
and i'm getting a bit confused with some functions and when/how to use them.
First one is what i've written in the title, what's the difference between lgb.train() and lightgbm()?
The description in the documentation(https://cran.r-project.org/web/packages/lightgbm/lightgbm.pdf) says that lgb.train is 'Logic to train with LightGBM' and lightgbm is 'Simple interface for training a LightGBM model', while both their outcome value is lgb.Booster, a trained model.
One difference I've found is that lgb.train() does not work with valids = , while lightgbm() does.
Second one is about a function lgb.cv(), regarding a cross validation in lightGBM. How do you apply the output of lgb.cv() to a model?
As I understood from the documentation i've linked above, it seems like the output of both lgb.cv and lgb.train is a model.
Is it correct to use it like the example below?
lgbcv <- lgb.cv(params,
lgbtrain,
nrounds = 1000,
nfold = 5,
early_stopping_rounds = 100,
learning_rate = 1.0)
lgbcv <- lightgbm(params,
lgbtrain,
nrounds = 1000,
early_stopping_rounds = 100,
learning_rate = 1.0)
Thank you in advance!
what's the difference between lgb.train() and lightgbm()?
These functions both train a LightGBM model, they're just slightly different interfaces. The biggest difference is in how training data are prepared. LightGBM training requires a special LightGBM-specific representation of the training data, called a Dataset. To use lgb.train(), you have to construct one of these beforehand with lgb.Dataset(). lightgbm(), on the other hand, can accept a data frame, data.table, or matrix and will create the Dataset object for you.
Choose whichever method you feel has a more friendly interface...both will produce a single trained LightGBM model (class "lgb.Booster").
that lgb.train() does not work with valids = , while lightgbm() does.
This is not correct. Both functions accept the keyword argument valids. Run ?lgb.train and ?lightgbm for documentation on those methods.
How do you apply the output of lgb.cv() to a model?
I'm not sure what you mean, but you can find an example of how to use lgb.cv() in the docs that show up when you run ?lgb.cv.
library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
params <- list(objective = "regression", metric = "l2")
model <- lgb.cv(
params = params
, data = dtrain
, nrounds = 5L
, nfold = 3L
, min_data = 1L
, learning_rate = 1.0
)
This returns an object of class "lgb.CVBooster". That object has multiple "lgb.Booster" objects in it (the trained models that lightgbm() or lgb.train() produce).
You can extract any one of these from model$boosters. However, in practice I don't recommend using the models from lgb.cv() directly. The goal of cross-validation is to get an estimate of the generalization error for a model. So you can use lgb.cv() to figure out the expected error for a given dataset + set of parameters (by looking at model$record_evals and model$best_score).
This is somewhat similar to the question I asked here. However, that question as zero answers and I think this question might be more fruitful in getting a response.
What I am trying to do is remove some features from an mlr created model, without having to fit the model again. For example, if we take the Boston data from the MASS library and create an mlr model, like so:
library(mlr)
library(MASS)
# Using the mlr package to train the data:
bTask <- makeRegrTask(data = Boston, target = "medv")
bLearn <- makeLearner("regr.randomForest")
bMod <- train(bLearn, bTask)
And then I use the task and trained model in some function, for example:
someFunc <- function(task, model){
pred <- predict(model, task)
pred <- pred$data$response
head(pred,10)
}
someFunc(bTask,bMod)
Everything works fine. But Im wondering if it's possible to remove some variables from bMod, without having to fit the mlr trained model again?
I know it's possible to drop features from the task using dropFeatures(), for example:
bTask1 <- dropFeatures(bTask, c("zn", "chas", "rad"))
But if I try to mix bTask1 and bMod like so:
pred1 <- predict(bMod, btask1)
I get the sensible error:
Error in predict.randomForest(.model$learner.model, newdata =
.newdata, : variables in the training data missing in newdata
Is there a way of dropping some features from the mlr created model (i.e, bMod) without fitting it again?
The question is more or less as the title indicates. I would like to use the caret::train function with beta-binomial models made with glmmTMB package (although I am not opposed to other functions capable of fitting beta-binomial models) to calculate median absolute error (MdAE) estimates through jack-knife (leave-one-out) cross-validation. The glmmTMBControl function is already capable of estimating the optimal dispersion parameter but I was hoping to retain this information somehow as well... or having caret do the calculation possibly?
The dataset I am working with looks like this:
df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(seq(1:20), each = 5))
Ideally I would be able to pass the glmmTMB function to trainControl like so:
BB.glmm1 <- train(Time ~ Effect,
data = df, method = "glmmTMB",
method = "", metric = "MAD")
The output would be as per the examples contained in train, although possibly with estimates for the dispersion parameter.
Although I am in no way opposed to work arounds - Thank you in advance!
I am unsure how to perform the required operation with caret without creating a custom method but I trust it is fairly easy to implement it with a for (lapply) loop.
In the example I will use the sleepstudy data set since your example data throws a bunch of warnings.
library(glmmTMB)
to perform LOOCV - for every row, create a model without that row and predict on that row:
data(sleepstudy,package="lme4")
LOOCV <- lapply(1:nrow(sleepstudy), function(x){
m1 <- glmmTMB(Reaction ~ Days + (Days|Subject),
data = sleepstudy[-x,])
return(predict(m1, sleepstudy[x,], type = "response"))
})
get the median of the residuals (I think this is MdAE? if not post a comment on how its calculated):
median(abs(unlist(LOOCV) - sleepstudy$Reaction))
Using RFE, you can get a importance rank of the features, but right now I can only use the model and parameter inner the package like: lmFuncs(linear model),rfFuncs(random forest)
it seems that
caretFuncs
can do some custom settings for your own model and parameter,but I don't know the details and the formal document didn't give detail, I want to apply svm and gbm to this RFE process,because this is the current model I used to train, anyone has any idea?
I tried to recreate working example based on the documentation. You correctly identified use of caretFuncs, you can then set your model parameters in rfe call (you can also define trainControl object etc).
# load caret
library(caret)
# load data, get target and feature column labels
data(iris)
col_names = names(iris);target = "Species"
feature_names = col_names[col_names!=target]
# construct rfeControl object
rfe_control = rfeControl(functions = caretFuncs, #caretFuncs here
method="cv",
number=5)
# construct trainControl object for your train method
fit_control = trainControl(classProbs=T,
search="random")
# get results
rfe_fit = rfe(iris[,feature_names], iris[,target],
sizes = 1:4,
rfeControl = rfe_control,
method="svmLinear",
# additional arguments to train method here
trControl=fit_control)
If you want to dive deeper into the matter you might want to visit links below.
rfe documentation with basic code snippets:
https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/rfe
caret documentation on rfe:
https://topepo.github.io/caret/recursive-feature-elimination.html
Hope this helps!
I am trying to perform classification using Support Vector Machines in R using e1071 package. Using the following code, and specifying the cost and gamma parameters, I could train the models successfully.
svm_models <- lapply(training_data,
function(data)
{
svm(label~., data=data,
method="C-classification", kernel="radial",
cost=10, gamma=0.1)
})
But If I perform paramter tuning within the above function as the following code,
svmmodels <- lapply(trainingdata,
function(data)
{
params <- tune.svm(label~., data=data,
gamma=10^(-6:-2), cost=10^(1:2))
svm(label~., data=data,
method="C-classification", kernel="radial",
cost=params$best.parameter[[2]], gamma=params$best.parameter[[1]])
})
then I get the following error:
Error in predict.svm(ret, xhold, decision.values = TRUE) (from #4) :
Model is empty!
What could be the possible cause of this issue?
Thanks.
According to ?tune, it should be best.parameters, not best.parameter. Try adding the 's' at the end of both instances in your code, and see if it works.
Very difficult to say much definitive with no data for testing, (or even a description of the data). However, it is possible to say that your calling svm after tune.svm is not in keeping with the example in the e1071::tune help page. Furthermore the formal parameter that the "cost" and "price" parmeters should be given as list elements is "range". You should not need to run svm on the output.