Decision Tree algorithms in R packages - r

Is there any way to specify the algorithm used in any of the R packages for decision tree formation? I know that CART and C5.0 models are available. I want to find out about other decision tree algorithms such as ID3, C4.5 and OneRule algorithms.
EDIT: Due to the ambiguous nature of my question, I would like to clarify it. Is there some function (say fun()) which creates and trains a decision tree wherein we can specify the algorithm as a parameter of the function fun()?
Like for example, to find the correlation between two vectors, we have cor() where we can specify the method used as pearson, spearman or kendall.
Is there such a function for decision trees as well so we can use different algorithms like ID3, C4.5, etc?

Related

Choosing the proper optimisation algorithm in R

I am trying to find extremum of a linear objective function with linear equality, linear inequality and nonlinear (quadratic) inequality constraints. The problem is I have already tried many algorithms from packages like nloptr, Rsolnp Nlcoptim and for every time I have obtained different results. What is more the results differ (in many cases) from GRG algorithm from Excel which can find better results in terms of the minimising objective function.
So far solnp (Rsolnp package) gives some good results and after proper calibrating the results are even better than the one from GRG Excel algorithm. Results from Solnl (NlcOptim) are average and very different, even if the data input is slightly changed.
Nloptr (Nloptr package) function has implemented various number of algorithms. I tried few (I do not remember which exactly) and the results were still average and completely different than the one obtained so far from other algorithms.
My knowledge about optimisation algorithms is really poor and my attempts are rather based on a random selection of algorithms. Thus could you advise some algorithms implemented in R that can handle such problem? And which one (and why) is better from another? Maybe there is some framework or decision tree regarding choosing proper optimisation algorithm.
If this can help, I try to find the optimal weights of the portfolio assets, where the objective function is to minimise portfolio risk (standard deviation), with all assets weights sum up to 1 and are greater then or equal to 0, and with defined portfolio return as constraints.

R: Clustering validation methods (mixed data)

I have clustered mixed dataset contains numerical and categorical features (heart dataset from UCI) using two clustering methods k-prototype and PAM
My question is: how to validate the results of clustering?
I have found different methods in R such as Rand Index, SSE, Purity, clValid, pvclust all of them works with numeric data.
Is there any method can be used in the case of mixed data
Yeah u can compare the clustering result with, CV index. For more u can read this
Cv index
CV formula contains of CU (Category Utility) for categorical attributes, and varians for numeric attributes
You can still use the Adjusted Rand Index. This index only compares two partitions. It does not matter if the partition is build from categorical or continuous features
How many observations (n) and dimensions (d) are you particularly studying?
Probably you are in the n>>d case, but more recently d>>n is a hot topic.
Variable selection is something that needs to be done before-hand. Check for feature correlation, this can affect the number of clusters that you detect. If the features are correlated and they happen to be linear, you can use the gradient instead of the two variables.
There is no absolute answer to your question. Many methods exist because of this. Clustering is explorative by nature. The better you know your data the better you can design tests.
Need to define what you want to test: stability of the partition, or, the stability of the clustering recipe. There are different ways to deal with each of these problems. For the first one, resampling is a key, and, for the second one, the use of comparison indexes to measure how many observations were left out of certain partition is often used.
Recommended reading:
[1]Meila, M. (2016). Criteria for Comparing Clusterings. Handbook of Cluster Analysis. C. Hennig, M. Meila, F. Murtagh and R. Rocci: 619-635.
[2]Leisch, F. (2016). Resampling Methods for Exploring Cluster Stability. Handbook of Cluster Analysis. C. Hennig, M. Meila, F. Murtagh and R. Rocci: 637-652.

Hierarchical Classification (on R)

First of all, sorry about my english, I am brazilian and I am improving it yet.
I have a hierarchical dataset which I used to use to create flat classification models (NaiveBayes, JRip, J48, SVM)...
For example:
> model<-svm(family~.,data=train)
> pred <-predict(model, test[,-ncol(test)])
And then I calculated Precision, Recall and F-measure, ignoring the fact that the dataset is organized hierarchically.
However, now I want to explore the fact that it is hierarchical and obtain different models and results. So what should I do? Considering the same ML algorithms (NaiveBayes, JRip, J48, SVM), how do I create the models? Should I change or include new parameters? Or should I continue as shown in the code before, and just use hierarchical Precision, hierarchical Recall and hierarchical F-measure as evaluation metrics? If so, is there any specific package?
Thanks!

What is the interpretation of the plot boxes of Logistic Model Tree (LMT) outcome in the RWeka package in r?

Im working on a user classification with 5 known groups (observations approximatly equally divided over groups). I have information about these users (like age, living area ...) and try to find the characteristics that identify the users in each group.
For this purpose I use the Rweka package in R (collection of machine learning algorithms: http://cran.r-project.org/web/packages/RWeka/RWeka.pdf). To find the characteristics that distinguish between my groups I use Logistic Model Trees (LMT). There is just little information about this function:
I will try to sketch an example of a plotted tree.
The splits are straight forward for interpretation, but in each terminal node there is a box filled with:
LM_24: 48/96
(20742)
What does this mean? How can I see in which of the five groups the node ends?
With what function can I retrieve the coefficients used in the model? Such that the influence of the variables can be studied.
(I did look into other methods for building trees on these data, but both the regression and classification tree packages (like rpart, party) only find one terminal note in my data, whilest the LMT function finds 6 split nodes)
I hope you can provide me the answer/some help with this function. Thanks a lot!

Splitting rules in mvpart vs rpart

I would like to make classification trees to predict the presence/absence of 1 bird species based on several variables. I know that rpart handles univariate partitioning and mvpart handles multivariate partitioning, but I'd like to use mvpart for my one-variable tree because of its more flexible output. Does anyone know of a reason that I should not do this? Will the splits be different in rpart vs mvpart with the same exact input?
It cannot be guaranteed that the splits will be the same; mvpart() is minimising the within groups sums of squares whereas rpart for a classification tree will be minimising the Gini coefficient (by default IIRC).
You may end up with the same model/splits but as the two functions are using two different measures of node impurity this may just be a fluke.
FYI, mvpart is fitting a regression model but you want a classification model.
Finally, consider using the party package and its function ctree; it has much nicer outputs than rpart by default but is, again, doing something slightly different in terms of model fitting.
As an aside, also look into the plotmo package which includes enhanced plots for a number of tree-like models including, IIRC, rpart ones.

Resources