Minbucket and weights in rpart - r

A couple questions for the rpart and party experts.
1) I am trying to understand the difference of the control parameter "minbucket" in rpart and party. Is it correct that minbucket in rpart is unweighted (even if weights are provided to fit the tree)?
2) Can anyone briefly describe how the weights are used in the rpart algorithm? I tried to download and review the source code, but I couldn't make much sense of it being a newbie. rpart calls a C function (C_rpart), which seems to be the main part of rpart, but I couldn't find more information about it.
Thanks so much in advance.

The weights parameter in rpart (and in most other machine learning algorithms) can be considered to be exactly equivalent to duplicating those training items that many times. A weight of 5 is the same as having that line repeated 5 times. You can explicitly create this using some simple code, provided that your data set is small enough:
data[rep(1:nrow(data),times=data$weights),]

Related

R: Evaluate Gradient Boosting Machines (GBM) for Regression

Which are the best metrics to evaluate the fit of a GBM algorithm in R (metrics, graphs, ratios)? And how interpret them?
I think maybe you are overthinking this one! Take a step back and think about what matters... the error. You have forecasted values and you have observed values. the difference tells you most of what you need to know when comparing across models. Basic measures like MSE, MPE, etc. should do fine. If you are looking to refine within a given model, I would recommend taking a look at the gbm documentation. For example, you can pass your gbm model object to summary(), to get the relative influence of each of your variables. Additionally, you can find a lot of information in the documentation, so if you haven't taken a look, I would recommend doing so! I have posted the link at the bottom.
-Carmine
gbm_documentation

Difference between "mlp" and "mlpML"

I'm using the Caret package from R to create prediction models for maximum energy demand. What i need to use is neural network multilayer perceptron, but in the Caret package i found out there's 2 of the mlp method, which is "mlp" and "mlpML". what is the difference between the two?
I have read description from a book (Advanced R Statistical Programming and Data Models: Analysis, Machine Learning, and Visualization) but it still doesnt answer my question.
Caret has 238 different models available! However many of them are just different methods to call the same basic algorithm.
Besides mlp there are 9 other methods of calling a multi-layer-perceptron one of which is mlpML. The real difference is only in the parameters of the function call and which model you need depends on your use case and what you want to adapt about the basic model.
Chances are, if you don't know what mlpML or mlpWeightDecay,etc. does you are fine to just use the basic mlp.
Looking at the official documentation we can see that:
mlp(size) while mlpML(layer1,layer2,layer3) so in the first method you can only tune the size of the multi-layer-perceptron while in the second call you can tune each layer individually.
Looking at the source code here:
https://github.com/topepo/caret/blob/master/models/files/mlp.R
and here:
https://github.com/topepo/caret/blob/master/models/files/mlpML.R
It seems that the difference is that mlpML allows several hidden layers:
modelInfo <- list(label = "Multi-Layer Perceptron, with multiple layers",
while mlp has one single layer with hidden units.
The official documentation also hints at this difference. In my opinion, it is not particularly useful to have many different models that differ only very slightly, and the documentation does not explain those slight differences well.

Implement random forest without bootstrap

I want to implement the random forest algorithm of Breiman (2001) using all my training set to grow the trees. In other words, I want to keep the random selection of inputs at each node and remove the bootstrap stage. This is motivated by the fact that I'm working with few observations that exhibit auto-correlation.
I've gone through the documentation of the packages randomForest, ranger and Rborist, but I didn't find an answer. I've also tried to take a look at the source code of the function randomForest using getAnywhere(randomForest.default); but I have to admit that my R-level is too low to be able to get anything out of it.
Thank you in advance.
Edit. Note to future readers: if you want to modify the bootstrap step, make sure to set keep.inbag=T when using randomForest.
The sampsize argument in randomForest controls the number of samples used for each tree and the replace argument controls whether or not you are bootstrapping. So in your case, set sampsize=N (number of samples) and replace=FALSE.

R Supervised Latent Dirichlet Allocation Package

I'm using this LDA package for R. Specifically I am trying to do supervised latent dirichlet allocation (slda). In the linked package, there's an slda.em function. However what confuses me is that it asks for alpha, eta and variance parameters. As far as I understand, I thought these parameters are unknowns in the model. So my question is, did the author of the package mean to say that these are initial guesses for the parameters? If yes, there doesn't seem to be a way of accessing them from the result of running slda.em.
Aside from coding the extra EM steps in the algorithm, is there a suggested way to guess reasonable values for these parameters?
Since you are trying to generate a supervised model, the typical approach would be to use cross validation to determine the model parameters. So you hold out some of the data as your test set, train the a model on the remaining data, and evaluate the model performance, repeating k times. You then continue to repeat with different model parameters to determine which result in the best model performance.
In the specific case of slda, I would run demo(slda) to see the author's implementation of it. When you run the demo, you'll see that he sets alpha=1.0, eta=0.1, and variance=0.25. I'd suggest using these as your starting point, and then use cross validation to determine better parameters if you need to improve model performance.

R - Party package: is cforest really bagging?

I'm using the "party" package to create random forest of regression trees.
I've created a ForestControl class in order to limit my number of trees (ntree), of nodes (maxdepth) and of variables I use to fit a tree (mtry).
One thing I'm not sure of is if the cforest algo is using subsets of my training set for each tree it generates or not.
I've seen in the documentation that it is bagging so I assume it should. But I'm not sure to understand well what the "subset" input is in that function.
I'm also puzzled by the results I get using ctree: when plotting the tree, I see that all my variables of my training set are classified in the different terminal tree nodes while I would have exepected that it only uses a subset here too.
So my question is, is cforest doing the same thing as ctree or is it really bagging my training set?
Thanks in advance for you help!
Ben

Resources