I would like to make classification trees to predict the presence/absence of 1 bird species based on several variables. I know that rpart handles univariate partitioning and mvpart handles multivariate partitioning, but I'd like to use mvpart for my one-variable tree because of its more flexible output. Does anyone know of a reason that I should not do this? Will the splits be different in rpart vs mvpart with the same exact input?
It cannot be guaranteed that the splits will be the same; mvpart() is minimising the within groups sums of squares whereas rpart for a classification tree will be minimising the Gini coefficient (by default IIRC).
You may end up with the same model/splits but as the two functions are using two different measures of node impurity this may just be a fluke.
FYI, mvpart is fitting a regression model but you want a classification model.
Finally, consider using the party package and its function ctree; it has much nicer outputs than rpart by default but is, again, doing something slightly different in terms of model fitting.
As an aside, also look into the plotmo package which includes enhanced plots for a number of tree-like models including, IIRC, rpart ones.
Related
My data doesn't contain any zeros. The minimum value for my outcome, y, is 1 and that is the value that is inflated. My objective is to run a truncated and inflated Poisson regression model using R.
I already know how to separate way each regression zero truncated and zero inflated. I want to know how to combine the two conditions into one model.
Thanks for you help.
For zero inflated models or zero-hurdle models, the standard approach is to use pscl package. I also wrote a package fitting that kind of models here but it is not yet mature and fully tested. Unless you have voluminous data, I still recommend you to use pscl that is more flexible, robust and documented.
For zero-truncated models, you can have a look at the VGML::vglm function. You might find useful information here.
Note that you are not doing the same distributional assumption so you won't need the same estimation data. Given the description of your dataset, I think you are looking for a zero-truncated model (since you do not observe zeros). With zero-inflated models, you decompose your observed pattern into zeros generated by a selection model and others generated by a count data model. This doesn't look to be a pattern consistent with your dataset.
I want to run a linear regression model with a large number of variables and I want an R function to iterate on good combinations of these variables and give me the best combination.
The glmulti package will do this fairly effectively:
Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.
Unsolicited advice follows:
HOWEVER. Please be aware that while this approach can find the model that minimizes within-sample error (the goodness of fit on your actual data), it has two major problems that should make you think twice about using it.
this type of data-driven model selection will almost always destroy your ability to make reliable inferences (compute p-values, confidence intervals, etc.). See this CrossValidated question.
it may overfit your data (although using the information criteria listed in the package description will help with this)
There are a number of different ways to characterize a "best" model, but AIC is a common one, and base R offers step(), and package MASS offers stepAIC().
summary(lm1 <- lm(Fertility ~ ., data = swiss))
slm1 <- step(lm1)
summary(slm1)
slm1$anova
I have built an SVM-RBF model in R using Caret. Is there a way of plotting the decisional boundary?
I know it is possible to do so by using other R packages but unfortunately I’m forced to use the Caret package because this is the only package I found that allows me to calculate the variables importance.
In alternative, can you suggest a package that allows to plot the decision boundaries AND gives also the vars importance?
Thank you very much
First of all, unlike other methods, SVM does not produce feature importance. In your case, the importance score caret reports is calculated independent of the method itself: https://topepo.github.io/caret/variable-importance.html#model-independent-metrics
Second, the decision boundary (or hyperplane) you see in most textbook example is based on a toy problem with only two or three features. If you have more than three features, it is not trivial to visualize this hyperplane.
I'm using the "party" package to create random forest of regression trees.
I've created a ForestControl class in order to limit my number of trees (ntree), of nodes (maxdepth) and of variables I use to fit a tree (mtry).
One thing I'm not sure of is if the cforest algo is using subsets of my training set for each tree it generates or not.
I've seen in the documentation that it is bagging so I assume it should. But I'm not sure to understand well what the "subset" input is in that function.
I'm also puzzled by the results I get using ctree: when plotting the tree, I see that all my variables of my training set are classified in the different terminal tree nodes while I would have exepected that it only uses a subset here too.
So my question is, is cforest doing the same thing as ctree or is it really bagging my training set?
Thanks in advance for you help!
Ben
Is there any way to specify the algorithm used in any of the R packages for decision tree formation? I know that CART and C5.0 models are available. I want to find out about other decision tree algorithms such as ID3, C4.5 and OneRule algorithms.
EDIT: Due to the ambiguous nature of my question, I would like to clarify it. Is there some function (say fun()) which creates and trains a decision tree wherein we can specify the algorithm as a parameter of the function fun()?
Like for example, to find the correlation between two vectors, we have cor() where we can specify the method used as pearson, spearman or kendall.
Is there such a function for decision trees as well so we can use different algorithms like ID3, C4.5, etc?