I'm exporting an R randomForest model to PMML. The resulting PMML always has the class as the first element of the DataDictionary element, which is not always true.
Is there some way to fix this or at least increment the PMML with custom Extension elements? That way I could put the class index there.
I've looked in the pmml package documentation, as well as in the pmmlTransformations packages, but couldn't find anything there that could help me solve this issue.
By PMML class I assume you mean the model type (classification vs regression) in the PMML model attributes?
If so, it is not true that the model type is determined from the data type of the first element of the DataDictionary....these are completely independent. The model type is determined from the model type R thinks it is. The R random forest object determines the type it thinks it is (model$type) and that is the model type exported by the pmml function. If you want your model to be a certain type, just make sure you let R know that...for example, if you are using the iris data set, if your predicted variable is Sepal.Length, R will correctly assume it is a regression model. If you insist on treating it as a classification model, try using as.factor(Sepal.Length) instead.
Related
'bst' is the name of an xgboost model that I built in R. It gives me predicted values for the test dataset using this code. So it is definitely an xgboost model.
pred.xgb <- predict(bst , xdtest) # get prediction in test sample
cor(ytestdata, pred.xgb)
Now, I would like to save the model so another can use the model with their data set which has the same predictor variables and the same variable to be predicted.
Consistent with page 4 of xgboost.pdf, the documentation for the xgboost package, I use the xgb.save command:
xgb.save(bst, 'xgb.model')
which produces the error:
Error in xgb.save(bst, "xgb.model") : model must be xgb.Booster.
Any insight would be appreciated. I searched the stack overflow and could not locate relevant advice.
Mike
It's hard to know exactly what's going on without a fully reproducible example. But just because your model can make predictions on the test data, it doesn't mean it's an xgboost model. It can be any type of model with a predict method.
You can try class(bst) to see the class of your bst object. It should return "xgb.Booster," though I suspect it won't here (hence your error).
On another note, if you want to pass your model to another person using R, you can just save the R object rather than exporting to binary, via:
save(bst, model.RData)
In R, the plm function documentation of package plm, we can read about possibility of choose one of three effects individual, time, twoways. Why such exists if I can just pick model type which already specifies which effect to use ? E.g. 'within' model will only use individual and random will always pick twoways. To say more - for example pooling model by definition takes no effect (no time and no individual) so choosing effect in this case is meaningless. What's the purpose of this additional input?
How do you come to this conclusion? The within model can be used with "individual", "time" or "twoways". You should see different results for your model coefficients, when choosing a different effect. Also, for example, when you use "time" or "twoways", you should be able to get the specific time effects via
summary(fixef(yourmodel,type = "level", effect="time")).
(My plm package version is 2.2-4.)
I'm a newbie in R, and have a question regarding "Tree" package.
I'm have created a classification model with the package, and now I want to try prediction. But I have no idea on how to do the prediction as well as class labeling.
All I've done so far is create the model with my training set and test set, and figure out its accuracy. But is there a way to do the actual prediction?
Like many models in R, you can use the predict function on new data points in order to get predictions for them, as well as class labels. More specifically, for a tree object, there is a specific doc page for its optional arguments.
In general, to get predictions on new data you can use this command
predict(your_tree_model, new_data)
and to get class labels
predict(your_tree_model, new_data, type = "class")
While trying to export an R classifier to PMML, using the pmml package, I noticed that the class distribution for a node in the tree is not exported.
PMML supports this with the ScoreDistribution element: http://www.dmg.org/v1-1/treemodel.html
Is there anyway to have this information in the PMML? I want to read the PMML with another tool that depends on this information.
I'm doing something like:
library(randomForest)
library(pmml)
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE,proximity=TRUE)
pmml(iris.rf)
Can you provide some more information..such as, which function you are trying to use.
For example, if you are using the randomForest package, I believe it doesn't provide information about the score distribution; so neither can the PMML representation. However, if you are using the default values, the parameter 'nodesize' for classification ceses, for example, equals 1 and that means the terminal node will have a ScoreDistribution such as:
ScoreDistribution value=predictedValue probability="1.0"/>
ScoreDistribution value=AnyOtherTargetCategoty probability="0.0"/>
If you are using the rpart tree model, the pmml function does output the score distribution information. Perhaps you can give us the exact commands you used?
I'm using randomForest in order to find out the most significant variables. I was expecting some output that defines the accuracy of the model and also ranks the variables based on their importance. But I am a bit confused now. I tried randomForest and then ran importance() to extract the importance of variables.
But then I saw another command rfcv (Random Forest Cross-Valdidation for feature selection), which should be the most appropriate for this purpose I suppose, but the question I have regarding this is: how to get the list of the most important variables? How to see the output after running it? Which command to use?
Another thing: What is the difference between randomForest and predict.randomForest?
I am not very familiar with randomforest and R therefore any help would be appreciated.
Thank you in advance!
After you have made a randomForest model you use predict.randomForest to use the model you created on new data e.g. build a random forest with training data then run your validation data through that model with predict.randomForest.
As for the rfcv there is an option recursive which (from the help):
whether variable importance is (re-)assessed at each step of variable
reduction
Its all in the help file