Reasonably new to this so sorry if I'm being thick.
Is there a way to pass existing models to caretEnsemble?
I have several models, run on the same training data, that I would like to ensemble with caretEnsemble.
Each model takes several hours to run, so I save them, then reload them when needed rather than re-run.
model_xgb <- train(oi_in_4_24_months~., method="xgbTree", data=training, trControl=train_control)
saveRDS(model_xgb, "model_xgb.rds")
model_logit <- train(oi_in_4_24_months~., method="LogitBoost", data=training, trControl=train_control)
saveRDS(model_logit, "model_logit.rds")
model_xgb <- readRDS("model_xgb.rds")
model_logit <- readRDS("model_logit.rds")
I want to pass these saved models to caretEnsemble, but as far as I can make out I can only pass a list of model types, e.g. "LogitBoost", "xgbTree", and caretEnsemble will both run the initial models, then ensemble them.
Is there a way to pass existing models, trained on the same data, to caretEnsemble?
The package author has an example script (https://gist.github.com/zachmayer/5152157) that suggests the following:
all_models <- list(model_xgb, model_logit)
names(all_models) <- sapply(all_models, function(x) x$method)
greedy <- caretEnsemble(all_models, iter=1000L)
But that produces an error
"Error: is(list_of_models, "caretList") is not TRUE".
I think that use of caretList previously wasn't compulsory, but now is.
I don't think you still need the solution to this but answering for anyone else that has the same question.
You can add models to be used by caretEnsemble or caretStack by using as.caretList(list(rpart2 = model_list1, gbm = model_list2))
But remember to use the same indexes for cross-validation/bootstrapping. 'If the indexes were different (or some stuff were not stored as "not/wrongly" specified in trainControl), it will throw an error when trying to use caretEnsemble or caretStack. Which is the expected behavior, obviously.' This issue on github has very clear and simple instructions.
Related
'bst' is the name of an xgboost model that I built in R. It gives me predicted values for the test dataset using this code. So it is definitely an xgboost model.
pred.xgb <- predict(bst , xdtest) # get prediction in test sample
cor(ytestdata, pred.xgb)
Now, I would like to save the model so another can use the model with their data set which has the same predictor variables and the same variable to be predicted.
Consistent with page 4 of xgboost.pdf, the documentation for the xgboost package, I use the xgb.save command:
xgb.save(bst, 'xgb.model')
which produces the error:
Error in xgb.save(bst, "xgb.model") : model must be xgb.Booster.
Any insight would be appreciated. I searched the stack overflow and could not locate relevant advice.
Mike
It's hard to know exactly what's going on without a fully reproducible example. But just because your model can make predictions on the test data, it doesn't mean it's an xgboost model. It can be any type of model with a predict method.
You can try class(bst) to see the class of your bst object. It should return "xgb.Booster," though I suspect it won't here (hence your error).
On another note, if you want to pass your model to another person using R, you can just save the R object rather than exporting to binary, via:
save(bst, model.RData)
I am trying to apply cross validation to a list of linear models and getting an error.
Here is my code:
library(MASS)
library(boot)
glm.fits = lapply(1:10,function(d) glm(nox~poly(dis,d),data=Boston))
cvs = lapply(1:10,function(i) cv.glm(Boston,glm.fits[[i]],K=10)$delta[1])
I get the error:
Error in poly(dis, d) : object 'd' not found
I then tried the following code:
library(MASS)
library(boot)
cvs=rep(0,10)
for (d in 1:10){
glmfit = glm(nox~poly(dis,d),data=Boston)
cvs[d] = cv.glm(Boston,glmfit,K=10)$delta[1]
}
and this worked.
Can anyone explain why my first attempt did not work, and suggest a fix?
Also, assuming a fix to the first attempt can be obtained, which way of writing code is better practice? (assume that I want a list of the various fits and that I would edit the latter code to preserve them) To me, the first attempt is more elegant.
In order for your first attempt to work, cv.glm (and maybe glm) would have to be written differently to take much more care about where evaluations are taking place.
The function cv.glm basically re-evaluates the model formula a bunch of times. It takes that model formula directly from the fitted glm object. So put yourself in R's shoes (as it were), and consider you're deep in the function cv.glm and you've been instructed to refit this model:
glm(formula = nox ~ poly(dis, d), data = Boston)
The fitted glm object has Boston in it, and glm knows to look first in Boston for variables, so it finds nox and dis easily. But where is d? It's not in Boston. It's not in the fitted glm object (and glm wouldn't know to look there anyway). In fact, it isn't anywhere. That d value existed only in the context of the lapply iterations and then disappeared.
In the second case, since d is currently an active variable in your for loop, after R fails to find d in the data frame Boston, it looks in the parent frame, in this case the global environment and finds your for loop index d and merrily keeps going.
If you need to use glm and cv.glm in this way I would just use the for loop; it might be possible to work around the evaluation issues, but it probably wouldn't be worth the time and hassle.
I am posting this because this postfeture selection in caret hasent helped my issue and I have 2 questions regarding feature selection function in caret package
when I run code below on my matrix of gene expression allsamplecombat with 5 classes defined in y= :
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(t(allsamplecombat[filter,]), y = factor(info$clust), sizes=c(300,400,500,600,700,800,1000,1200), rfeControl=control)
I get an out put like this
So, I want to know if I can extract top features for each classes, because predictors(results) just give me the resulting feature without indicating importance for each classes.
my second problem is that when i try to change rfeControl functions to treebagFuncs and run 'parRF` method
control <- rfeControl(functions=treebagFuncs, method="cv", number=5)
results <- rfe(t(allsamplecombat[filter,]), y = factor(info$clust), sizes=c(400,500,600,700,800), rfeControl=control, method="parRF")
i get Error in { : task 1 failed - "subscript out of bounds" error.
what is wrong in my code?
For the importances, there is a sub-object called variables that contains this information for each step of the elimination.
treebagFuncs is designed to work with ipred's bagging function and isn't related to random forest.
You would probably used caretFuncs and pass method to that. However, if you are going to parallelize something, do it to the resampling loop and not the model function. This is generally more efficient. Note that if you do both with M workers, you might actually get M^3 (one for rfe, one for train, and one for parRF). There are options in rfe and train to turn their parallelism off.
stackoverflowers , I need some help from tensorflow experts. Actually I've buid a multi-layer perceptron, trained it, tested it and everything seemed ok. However, When I restored the model and tried to use it again, its accuracy does not correspond to the trained model and the predictions are pretty different from the real labels. The code I am using for the restoring - prediction is the following : (I'm using R)
pred <- multiLayerPerceptron(test_data)
init <- tf$global_variables_initializer()
with(tf$Session() %as% sess, {
sess$run(init)
model_saver$restore(sess, "log_files/model_MLP1")
test_pred_1 <- sess$run(pred, feed_dict= dict(x = test_data))
})
Is everything Ok with the code ? FYI I wanted by this part of the code to get the predictions of my model for test_data.
Your code does not show where model_saver is initialized, but it should be created after you create the computational graph. If not, it does not know which variables to restore/save. So create your model_saver after pred <- multiLayerPerceptron(test_data).
Note that, if you made the same mistake during training, your checkpoint will be empty and you will need to retrain your model first.
Today while utilizing the stepAIC function to narrow down a list of variables, I had a thought about how the function was exactly working. Consider this code:
for (i in 1:5) {
model <- lm(train_y ~ ., data = train[fold != i,])
model <- stepAIC(model)
}
In this example, each iteration is subsetting the training dataset based on the fold variable. When the stepAIC executes, will it 'honor' the row filtering that was originally applied, or will it use the entire 'train' dataset.
I see the documentation says this: "The model fitting must apply the models to the same dataset." , but I dont know if it is just using the same dataset name, or it uses the same filtering as well.
I know there are lots of discussions about the correct way to select features via lasso, stepwise, etc. , so im just interested here is how the the function is working. I am aware there are likely better ways to select variables.
Thanks