How to save a glmnet model to a file in R? - r

When I am using R, how can I save a model built by glmnet to a file, and then read it from the file so as to use it to predict?
Is it also the same if I use cv.glmnet to build the model?
Thanks!

Maybe I misunderstand your point, but it is always feasible to use the save function to save your R object in the .RData file. Next time, you simply use load(YourFile.RData) to load the object(s) into session.

library(glmnet)
library(ISLR)
# Data and model
auto = ISLR::Auto
mod = cv.glmnet(as.matrix(auto[1:300,2:6]), as.matrix(auto[1:300,1]), type.measure = "mse", nfolds = 5)
predict(mod, newx = as.matrix(auto[300:392,2:6]), s = "lambda.min")
# Save model
save(mod, file="D:/mymodel.RData")
rm(mod)
# Reload model
load("D:/mymodel.RData")
predict(mod, newx = as.matrix(auto[300:392,2:6]), s = "lambda.min")

Related

readRDS error from reading a R object of ~ 160MB

I’m trying to save a R object, which is a linear regression model based on ridge regression using R package glmnet. I'm using saveRDS to save it and it runs without error.
saveRDS(ridge, file = 'rnaClassifer_ridgeReg.RDdata')
HOWEVER, I cannot load the object back to R studio via readRDS, and it keeps giving errors and crashes the R session.
readRDS('rnaClassifer_ridgeReg.RDdata')
Note here this is a R object with size of 161MB after saving as rnaClassifer_ridgeReg.RDdata (which can be downloaded from here). My local laptop has 8 cores / 32 GB, which I would think is enough?
Here I'm also attaching the dataset (here) used to build the regression model, along with the code. Feel free to run the commands below to generate the R object ridge, and see if you can save it and load it successfully back to R.
library (caret)
library (glmnet)
data.lm.train = read.table('data.txt.gz', header = T, sep = '\t', quote = '', check.names = F)
lambda <- 10^seq(-3, 3, length = 100)
### ridge regression
set.seed(666)
ridge <- train(
dnaScore ~., data = data.lm.train, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0, lambda = lambda)
)
Any help would be highly appreciated!

Caret train function for muliple data frames as function

there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.
By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)

Problem crating a Ranger model with R to use for MLflow

I am trying to use MLflow in R. According to https://www.mlflow.org/docs/latest/models.html#r-function-crate, the crate flavor needs to be used for the model. My model uses the Random Forest function implemented in the ranger package:
model <- ranger::ranger(formula = model_formula,
data = trainset,
importance = "impurity",
probability=T,
num.trees = 500,
mtry = 10)
The model itself works and I can do the prediction on a testset:
test_prediction <- predict(model, testset)
As a next step, I try to bring the model in the crate flavor. I follow here the approach shown in https://docs.databricks.com/_static/notebooks/mlflow/mlflow-quick-start-r.html.
predictor <- crate(function(x) predict(model,.x))
This results however in an error, when I apply the "predictor" on the testset
predictor(testset)
Error in predict(model, .x) : could not find function "predict"
Does anyone know how to solve this issue? To I have to transfer the prediction function differently in the crate function? Any help is highly appreciated ;-)
In my experience, that Databricks quickstart guide is wrong.
According to the Carrier documentation, you need to use explicit namespaces when calling non-base functions inside of crate. Since predict is actually part of the stats package, you'd need to specify stats::predict. Also, since your crate function depends on the global object named model, you'd need to pass that as an argument to the crate function as well.
Your code would end up looking something like this (I can't test it on your exact use case, since I don't have your data, but this works for me on MLflow in Databricks):
model <- ranger::ranger(formula = model_formula,
data = trainset,
importance = "impurity",
probability=T,
num.trees = 500,
mtry = 10)
predictor <- crate(function(x) {
stats::predict(model,x)
}, model = model)
predictor(testset)

What's the difference between lgb.train() and lightgbm() in r?

I'm trying to build a regression model with R using lightGBM,
and i'm getting a bit confused with some functions and when/how to use them.
First one is what i've written in the title, what's the difference between lgb.train() and lightgbm()?
The description in the documentation(https://cran.r-project.org/web/packages/lightgbm/lightgbm.pdf) says that lgb.train is 'Logic to train with LightGBM' and lightgbm is 'Simple interface for training a LightGBM model', while both their outcome value is lgb.Booster, a trained model.
One difference I've found is that lgb.train() does not work with valids = , while lightgbm() does.
Second one is about a function lgb.cv(), regarding a cross validation in lightGBM. How do you apply the output of lgb.cv() to a model?
As I understood from the documentation i've linked above, it seems like the output of both lgb.cv and lgb.train is a model.
Is it correct to use it like the example below?
lgbcv <- lgb.cv(params,
lgbtrain,
nrounds = 1000,
nfold = 5,
early_stopping_rounds = 100,
learning_rate = 1.0)
lgbcv <- lightgbm(params,
lgbtrain,
nrounds = 1000,
early_stopping_rounds = 100,
learning_rate = 1.0)
Thank you in advance!
what's the difference between lgb.train() and lightgbm()?
These functions both train a LightGBM model, they're just slightly different interfaces. The biggest difference is in how training data are prepared. LightGBM training requires a special LightGBM-specific representation of the training data, called a Dataset. To use lgb.train(), you have to construct one of these beforehand with lgb.Dataset(). lightgbm(), on the other hand, can accept a data frame, data.table, or matrix and will create the Dataset object for you.
Choose whichever method you feel has a more friendly interface...both will produce a single trained LightGBM model (class "lgb.Booster").
that lgb.train() does not work with valids = , while lightgbm() does.
This is not correct. Both functions accept the keyword argument valids. Run ?lgb.train and ?lightgbm for documentation on those methods.
How do you apply the output of lgb.cv() to a model?
I'm not sure what you mean, but you can find an example of how to use lgb.cv() in the docs that show up when you run ?lgb.cv.
library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
params <- list(objective = "regression", metric = "l2")
model <- lgb.cv(
params = params
, data = dtrain
, nrounds = 5L
, nfold = 3L
, min_data = 1L
, learning_rate = 1.0
)
This returns an object of class "lgb.CVBooster". That object has multiple "lgb.Booster" objects in it (the trained models that lightgbm() or lgb.train() produce).
You can extract any one of these from model$boosters. However, in practice I don't recommend using the models from lgb.cv() directly. The goal of cross-validation is to get an estimate of the generalization error for a model. So you can use lgb.cv() to figure out the expected error for a given dataset + set of parameters (by looking at model$record_evals and model$best_score).

R bsts predictions are not consistent

Whenever I run the predict function multiple times on a bsts model using the same prediction data, I get different answers. So my question is, is there a way to return consistent answers given I keep my predictor dataset the same?
Example using the iris data set (I know it's not time series but it will illustrate my point)
iris_train <- iris[1:100,1:3]
iris_test <- iris[101:150,1:3]
ss <- AddLocalLinearTrend(list(), y = iris_train$Sepal.Length)
iris_bsts <- bsts(formula = Sepal.Length ~ ., data = iris_train,
state.specification = ss,
family = 'gaussian', seed = 1, niter = 500)
burn <- SuggestBurn(0.1,iris_bsts)
Now if I run this following line say, 10 times, each result is different:
iris_predict <- predict(iris_bsts, newdata = iris_test, burn = burn)
iris_predict$mean
I understand that it is running MCMC simulations, but I require consistent results and have therefore tried:
Setting the seed in bsts and before predict
Setting the state space standard deviation to near 0, which just creates unstable results.
And neither seem to work. Any help would be appreciated!
I encountered the same problem. To fix it, you need to set the random seed in the embedded C code. I forked the packaged and made the modifications here: BSTS.
For package installation only, download bsts_0.7.1.1.tar.gz in the build folder. If you already have bsts installed, replace it with this version via:
remove.packages("bsts")
# assumes working directory is whre file is located
install.packages("bsts_0.7.1.1.tar.gz", repos=NULL, tyype="source")
If you do not have bsts installed, please install it first to ensure all dependencies are there. (This may require installing Rtools, Boom, and BoomSpikeSlab individually.)
This package version only modifies the predict function from bsts, all code should work as is. It automatically sets the random seed to 1 each time predict is called. If you want predictions to vary, you'll need to explicitly set the predict parameter each time.
You can make a function to specify seed each time (set.seed was unnecessary...):
reproducible_predict <- function(S) {
iris_bsts <- bsts(formula = Sepal.Length ~ ., data = iris_train, state.specification = ss, seed = S, family = 'gaussian', niter = 500)
burn <- SuggestBurn(0.1,iris_bsts)
iris_predict <- predict(iris_bsts, newdata = iris_test, burn = burn)
return(iris_predict$mean)
}
reproducible_predict(1)
[1] 7.043592 6.212780 6.789205 6.563942 6.746156
reproducible_predict(1)
[1] 7.043592 6.212780 6.789205 6.563942 6.746156
reproducible_predict(200)
[1] 7.013679 6.173846 6.763944 6.567651 6.715257
reproducible_predict(200)
[1] 7.013679 6.173846 6.763944 6.567651 6.715257
I have come across the same issue.
The problem comes from setting the seed within the model definition only.
To solve your problem, you have to set a seed within the predict function such as:
iris_predict <- predict(iris_bsts, newdata = iris_test, burn = burn, seed=X)
Hope this helps.

Resources