how to save a fitted R model for later use

how to save a fitted R model for later use - r

Sorry for this novice question: if I fit a lm() model or loess() model, and save the model somewhere in a file or in a database, for later use by a third party with predict() method, do I have to save the entire model object? Since returned model object contains the original raw data, this returned object can be huge.

If you include the argument model = FALSE (it's true by default) when fitting the model, the model frame that was used will be excluded from the resulting object. You can get an estimate of the memory that is being used to store the model object giving:
object.size(my_model)

Related

R save xgb model command error: 'model must be xgb.Booster'

'bst' is the name of an xgboost model that I built in R. It gives me predicted values for the test dataset using this code. So it is definitely an xgboost model.
pred.xgb <- predict(bst , xdtest) # get prediction in test sample
cor(ytestdata, pred.xgb)
Now, I would like to save the model so another can use the model with their data set which has the same predictor variables and the same variable to be predicted.
Consistent with page 4 of xgboost.pdf, the documentation for the xgboost package, I use the xgb.save command:
xgb.save(bst, 'xgb.model')
which produces the error:
Error in xgb.save(bst, "xgb.model") : model must be xgb.Booster.
Any insight would be appreciated. I searched the stack overflow and could not locate relevant advice.
Mike

It's hard to know exactly what's going on without a fully reproducible example. But just because your model can make predictions on the test data, it doesn't mean it's an xgboost model. It can be any type of model with a predict method.
You can try class(bst) to see the class of your bst object. It should return "xgb.Booster," though I suspect it won't here (hence your error).
On another note, if you want to pass your model to another person using R, you can just save the R object rather than exporting to binary, via:
save(bst, model.RData)

the meaning of model output number in R for linear regression model

Do I understand correctly that the value (model output) retrieving from evaluate_model() for the linear regression model is RMSE?

No, the column output, model_output, is not the root mean square error (RMSE). It is the prediction value for your model.
So it appears that the evaluate_model() function is from the statisticalModeling package.
According to the documentation for this function, it "Evaluate a model for specified inputs" and (emphasis added below)
Find the model outputs for specified inputs. This is equivalent to the generic predict() function, except it will choose sensible values by default. This simplifies getting a quick look at model values.
In other words, this function, evaluate_model(), takes inputs and shows you the "prediction" of your input data using the model.
For your instance, evaluate_model() will take each row of your data and use the column/variable information (age, sex, coverage) for each row, and fits the data to estimate what the dependent variable would be based on this model.

Is there a package that allows online updating of glm with new data?

Is there a package that allows updating of a logistic regression with new data, without retraining the model on all data points again?
i.e. say I have fitted a model glm(y~., data=X). Then some time later I receive more data (more rows) Y. I need a model that trains on rbind(X,Y). But instead of re-training the model on this new combined dataset, it would be good if the model can simply update itself using Y (either using bayesian or frequentist method).
The reason I am seeking an update method is that over time, the dataset will grow to be huge, so the re-training method will become increasing computationally infeasible.

Used saveRDS to save a model but not enough memory to readRDS?

I created a model based on a very large dataset and had the program save the results using
saveRDS(featVarLogReg.mod, file="featVarLogReg.mod.RDS")
Now I'm trying to load the model to evaluate, but readRDS runs out of memory.
featVarLR.mod <- readRDS(file = "featVarLogReg.mod.RDS")
Is there a way to load the file that takes less memory? Or at least the same amount of memory that was used to save it?
The RDS file ended up being 1.5GB in size for this logistic regression using caret. My other models using the same dataset and very similar caret models were 50MB in size so I can load them.

The caret linear model saves the training data in the model object. You could try to use returnData = FALSE in the trainControl argument to train. I don't recall if this fixed my issue in the past.
https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/trainControl
You could also try to just export the coefficients into a dataframe and use a manual formula to score new data.
Use coef(model_object)

R randomForest to PMML class index is wrong

I'm exporting an R randomForest model to PMML. The resulting PMML always has the class as the first element of the DataDictionary element, which is not always true.
Is there some way to fix this or at least increment the PMML with custom Extension elements? That way I could put the class index there.
I've looked in the pmml package documentation, as well as in the pmmlTransformations packages, but couldn't find anything there that could help me solve this issue.

By PMML class I assume you mean the model type (classification vs regression) in the PMML model attributes?
If so, it is not true that the model type is determined from the data type of the first element of the DataDictionary....these are completely independent. The model type is determined from the model type R thinks it is. The R random forest object determines the type it thinks it is (model$type) and that is the model type exported by the pmml function. If you want your model to be a certain type, just make sure you let R know that...for example, if you are using the iris data set, if your predicted variable is Sepal.Length, R will correctly assume it is a regression model. If you insist on treating it as a classification model, try using as.factor(Sepal.Length) instead.