First of all, sorry about my english, I am brazilian and I am improving it yet.
I have a hierarchical dataset which I used to use to create flat classification models (NaiveBayes, JRip, J48, SVM)...
For example:
> model<-svm(family~.,data=train)
> pred <-predict(model, test[,-ncol(test)])
And then I calculated Precision, Recall and F-measure, ignoring the fact that the dataset is organized hierarchically.
However, now I want to explore the fact that it is hierarchical and obtain different models and results. So what should I do? Considering the same ML algorithms (NaiveBayes, JRip, J48, SVM), how do I create the models? Should I change or include new parameters? Or should I continue as shown in the code before, and just use hierarchical Precision, hierarchical Recall and hierarchical F-measure as evaluation metrics? If so, is there any specific package?
Thanks!
Related
In Azure ML, I have a predictive regression model using boosted decision tree regression and it is reasonably accurate.
The input dataset has over 450 columns and the model has done a good job of predicting against test data sets, without over-fitting.
To report on the result i need to know what features/columns the model mainly used to make predictions but i cant find this information easily when looking at the trained model data.
How do i identify this information? Im happy to import the result dataset into R to help find this but I just need pointers on what direction to start working in.
Mostly, in using Microsoft Azure Machine Learning, when looking at the features that is mainly used to make predictions, it is found on the output of the Train Model module.
But on using Decision Trees as your algorithm, the output of your Train Model module would be the constructed 'trees' of the algorithm, and it looks like this:
To know the features that made impact on predictions while using Decision Trees algorithms, you can use the Permutation Feature Importance module. Look at the sample experiment below:
The parameters of Permutation Feature Importance are Random Seed and Metric for Measuring Performance (in this case, Regression - Coefficient of Determination)
The left input of Permutation Feature Importance is your trained model, and the right input is your test data.
The output of Permutation Feature Importance looks like this:
You can add Execute R Script to extract the Features and Scores from Permutation Feature Importance module.
I'm using the "party" package to create random forest of regression trees.
I've created a ForestControl class in order to limit my number of trees (ntree), of nodes (maxdepth) and of variables I use to fit a tree (mtry).
One thing I'm not sure of is if the cforest algo is using subsets of my training set for each tree it generates or not.
I've seen in the documentation that it is bagging so I assume it should. But I'm not sure to understand well what the "subset" input is in that function.
I'm also puzzled by the results I get using ctree: when plotting the tree, I see that all my variables of my training set are classified in the different terminal tree nodes while I would have exepected that it only uses a subset here too.
So my question is, is cforest doing the same thing as ctree or is it really bagging my training set?
Thanks in advance for you help!
Ben
Is there any way to specify the algorithm used in any of the R packages for decision tree formation? I know that CART and C5.0 models are available. I want to find out about other decision tree algorithms such as ID3, C4.5 and OneRule algorithms.
EDIT: Due to the ambiguous nature of my question, I would like to clarify it. Is there some function (say fun()) which creates and trains a decision tree wherein we can specify the algorithm as a parameter of the function fun()?
Like for example, to find the correlation between two vectors, we have cor() where we can specify the method used as pearson, spearman or kendall.
Is there such a function for decision trees as well so we can use different algorithms like ID3, C4.5, etc?
I would like to make classification trees to predict the presence/absence of 1 bird species based on several variables. I know that rpart handles univariate partitioning and mvpart handles multivariate partitioning, but I'd like to use mvpart for my one-variable tree because of its more flexible output. Does anyone know of a reason that I should not do this? Will the splits be different in rpart vs mvpart with the same exact input?
It cannot be guaranteed that the splits will be the same; mvpart() is minimising the within groups sums of squares whereas rpart for a classification tree will be minimising the Gini coefficient (by default IIRC).
You may end up with the same model/splits but as the two functions are using two different measures of node impurity this may just be a fluke.
FYI, mvpart is fitting a regression model but you want a classification model.
Finally, consider using the party package and its function ctree; it has much nicer outputs than rpart by default but is, again, doing something slightly different in terms of model fitting.
As an aside, also look into the plotmo package which includes enhanced plots for a number of tree-like models including, IIRC, rpart ones.
I am training a SVM classifier. Right now, I have about 4000 features, but a lot of them are redundant/uninformative. I want to reduce the features in the model to about maybe 20-50. I would like to use greedy hill climbing, reducing the features by 1 each time.
The removed feature should be the least important feature. After training an SVM, how do I get the ranking of the importance of the features? If I am using libsvm in R, how do I get the weight of each feature, or some other similar type of indicator of importance? Thanks!
I would reduce the dimensionality of the problem first using PCA (Principal Component Analysis), then apply SVM. See, e.g., Andrew Ng's lecture videos