Predicting from an already fitted model in an R package - r

I'm working on an R package where I need to call predict.lm on a model I've already fit. I've saved the linear model as a file which I can put in the data folder of the package. I'm worried about slowing things down if I load the model every time the function is called. The function that uses this model is the meat of the package and gets called on every iteration of a simulation, so I'd prefer to read the saved model once when the package is loaded. Is there a way to do that?

Why not just save the coefficients and then "predict" with them?
c.vec <- coef(fit) # Intercept + terms
Yhat <- c.vec * c(1, data.vec)

Related

Why does tab_model (sjPlot) re-run MCMC with rstanarm model?

I am creating a table with tab_model from the package sjPlot (https://cran.r-project.org/web/packages/sjPlot/vignettes/tab_model_estimates.html).
However, when I use a negative binomial rstanarm model object, tab_model re-runs MCMC chains.
My actual model takes many hours to run, so this is not ideal for tab_model to be doing this, but it doesn't seem to do it for other models (such as with glmer in lme4).
library(rstanarm)
library(lme4)
dat.nb<-data.frame(x=rnorm(200),z=rep(c("A","B","C","D"),50),
y=rnbinom(200,size=1,prob = .5))
mod1<-glmer.nb(y~x+(1|z),data=dat.nb)
options(mc.cores = parallel::detectCores())
mod2<-stan_glmer.nb(y~x+(1|z),data=dat.nb)
Now to create the model tables:
library(sjPlot)
tab_model(mod1)
The output is quick, and as expected (although the original model also ran quick, so it is possible that tab_model is re-running the model here too).
Now when I try
tab_model(mod2)
It begins re-running MCMC. Is this normal behavior, and if so, is anyone familiar with a way to turn this off, and just use the model object already created, rather than re-running the model?
tl;dr I think this is going to be hard to avoid without hacking both the insight package and this one, or asking the package maintainer for an edit, unless you want to forgo printing the ICC, R^2, and the random-effects variance. Here, tab_model() calls insight::get_variance(), which tries to compute variances for the null model so it can compute the ICC and R^2. Computing these variances requires re-running the model. (When it does it for the glmer.nb, it goes via lme4:::update.merMod() and is quick enough that you don't notice the computation time.)
So
tab_model(mod2,show.r2=FALSE,show.icc=FALSE,show.re.var=FALSE)
doesn't recompute anything. In theory I think it should be possible to skip the resampling/recomputation step with just show.r2=FALSE, show.icc=FALSE (i.e. it shouldn't be necessary to get the RE var), but this would take some hacking/participation by the maintainer.
Digging in (by using debug(rstan::sampling) to stop inside the Stan sampling function, then where to see the call stack ...
tab_model() calls insight::get_variance() here
the insight::get_variance.stanreg() method calls insight:::.compute_variances()
... which calls insight:::.compute_variance_distribution()
... which (for a log-link count distribution) calls insight:::.variance_distributional()
... which calls null_model
... which calls .null_model_mixed()
... which calls stats::update()

predicting with zoib models (MCMC / RJags)

I am using the zoib package in R to build zero-inflated beta regression models. I am looking for a simple way to use the models that zoib produces to calculate a predicted response for a new dataset. By "new dataset" I mean data not used to build the original zoib models.
I know I can just take the zoib model parameters and manually write a function in R to predict with but I want to utilise the fact that zoib models are Bayesian so I can get a posterior distribution of possible response values. My plan is to use the posterior distributions to calculate confidence intervals around each prediction.
Because zoib uses a MCMC approach within RJags I have investigated these two solutions:
manipulating the code within RJags
appending the new data with an "NA" response variable
The first solution I don't know how to implement because zoib runs RJags internally and the zero-inflated model it runs is very complicated. I tried the second solution but it just ignored the rows of data that I appended with "NA" response values.
I emailed the zoib package developers and this was there response.
For now, the zoib function can only output posterior predictive samples for Y given the X in the data set where the zoib regression is applied to, but not for a new set of X's. Your suggestion can be easily incorporated into the new version of the package, which is expected to be out in about a few weeks.

How to get predictions MLP tensorflow from a restored model?

stackoverflowers , I need some help from tensorflow experts. Actually I've buid a multi-layer perceptron, trained it, tested it and everything seemed ok. However, When I restored the model and tried to use it again, its accuracy does not correspond to the trained model and the predictions are pretty different from the real labels. The code I am using for the restoring - prediction is the following : (I'm using R)
pred <- multiLayerPerceptron(test_data)
init <- tf$global_variables_initializer()
with(tf$Session() %as% sess, {
sess$run(init)
model_saver$restore(sess, "log_files/model_MLP1")
test_pred_1 <- sess$run(pred, feed_dict= dict(x = test_data))
})
Is everything Ok with the code ? FYI I wanted by this part of the code to get the predictions of my model for test_data.
Your code does not show where model_saver is initialized, but it should be created after you create the computational graph. If not, it does not know which variables to restore/save. So create your model_saver after pred <- multiLayerPerceptron(test_data).
Note that, if you made the same mistake during training, your checkpoint will be empty and you will need to retrain your model first.

How to pass saved models to caretEnsemble

Reasonably new to this so sorry if I'm being thick.
Is there a way to pass existing models to caretEnsemble?
I have several models, run on the same training data, that I would like to ensemble with caretEnsemble.
Each model takes several hours to run, so I save them, then reload them when needed rather than re-run.
model_xgb <- train(oi_in_4_24_months~., method="xgbTree", data=training, trControl=train_control)
saveRDS(model_xgb, "model_xgb.rds")
model_logit <- train(oi_in_4_24_months~., method="LogitBoost", data=training, trControl=train_control)
saveRDS(model_logit, "model_logit.rds")
model_xgb <- readRDS("model_xgb.rds")
model_logit <- readRDS("model_logit.rds")
I want to pass these saved models to caretEnsemble, but as far as I can make out I can only pass a list of model types, e.g. "LogitBoost", "xgbTree", and caretEnsemble will both run the initial models, then ensemble them.
Is there a way to pass existing models, trained on the same data, to caretEnsemble?
The package author has an example script (https://gist.github.com/zachmayer/5152157) that suggests the following:
all_models <- list(model_xgb, model_logit)
names(all_models) <- sapply(all_models, function(x) x$method)
greedy <- caretEnsemble(all_models, iter=1000L)
But that produces an error
"Error: is(list_of_models, "caretList") is not TRUE".
I think that use of caretList previously wasn't compulsory, but now is.
I don't think you still need the solution to this but answering for anyone else that has the same question.
You can add models to be used by caretEnsemble or caretStack by using as.caretList(list(rpart2 = model_list1, gbm = model_list2))
But remember to use the same indexes for cross-validation/bootstrapping. 'If the indexes were different (or some stuff were not stored as "not/wrongly" specified in trainControl), it will throw an error when trying to use caretEnsemble or caretStack. Which is the expected behavior, obviously.' This issue on github has very clear and simple instructions.

Test for Multicollinearity in Panel Data R

I am running a panel data regression using the plm package in R and want to control for multicollinearity between the explanatory variables.
I know there is the vif() function in the car-package, however as far as I know, it cannot deal with panel data output.
The plm can do other diagnostics such as a unit root test but I found no method to calculate for multicollinearity.
Is there a way to calculate a similar test to vif, or can I just regard each variable as a time-series, leaving out the panel information and run tests using the car package?
I cannot disclose the data, but the problem should be relevant to all panel data models.
The dimension is roughly 1,000 observations, over 50 time-periods.
The code I use looks like this:
pdata <- plm.data(RegData, index=c("id","time"))
fixed <- plm(Y~X, data=pdata, model="within")
and then
vif(fixed)
returns an error.
Thank you in advance.
This question has been asked with reference to other statistical packages such as SAS https://communities.sas.com/thread/47675 and Stata http://www.stata.com/statalist/archive/2005-08/msg00018.html and the common answer has been to use pooled model to get VIF. The logic is that since multicollinearity is only about independent variable there is no need to control for individual effects using panel methods.
Here's some code extracted from another site:
mydata=read.csv("US Panel Data.csv")
attach(mydata) # not sure is that's really needed
Y=cbind(Return) # not sure what that is doing
pdata=plm.data(mydata, index=c("id","t"))
model=plm(Y ~ 1+ESG+Beta+Market.Cap+PTBV+Momentum+Dummy1+Dummy2+Dummy3+Dummy4+Dummy5+
Dummy6+Dummy7+Dummy8+Dummy9,
data=pdata,model="pooling")
vif(model)

Resources