identifying key columns/features used by decision tree regression - r

In Azure ML, I have a predictive regression model using boosted decision tree regression and it is reasonably accurate.
The input dataset has over 450 columns and the model has done a good job of predicting against test data sets, without over-fitting.
To report on the result i need to know what features/columns the model mainly used to make predictions but i cant find this information easily when looking at the trained model data.
How do i identify this information? Im happy to import the result dataset into R to help find this but I just need pointers on what direction to start working in.

Mostly, in using Microsoft Azure Machine Learning, when looking at the features that is mainly used to make predictions, it is found on the output of the Train Model module.
But on using Decision Trees as your algorithm, the output of your Train Model module would be the constructed 'trees' of the algorithm, and it looks like this:
To know the features that made impact on predictions while using Decision Trees algorithms, you can use the Permutation Feature Importance module. Look at the sample experiment below:
The parameters of Permutation Feature Importance are Random Seed and Metric for Measuring Performance (in this case, Regression - Coefficient of Determination)
The left input of Permutation Feature Importance is your trained model, and the right input is your test data.
The output of Permutation Feature Importance looks like this:
You can add Execute R Script to extract the Features and Scores from Permutation Feature Importance module.

Related

How to do cross-validation in R using neuralnet?

I'm trying to build a predictive model, using the neuralnet package. First I'm spliting my dataset in training (80%) and test (20%). But ANN is such a powerful technique that my model easily overfits the training set and performs poorly on the external test set.
Predicted vs True Value - Training is the right one and test set is the left one
Is there a way to do a cross-validation on the training set so that my model doesn't overfit the set? How may I do this with my own built in function?
Plus, are there any other approaches when dealing with deep learning? I've heard you can tweak the weights of the model in order to improve its quality on external data.
Thanks in advance!

Assessing LDA predictions with textmineR in R - Calculating perplexity?

I am working on a LDA model with textmineR, have calculated coherence, log-likelihood measures and optimized my model.
As a last step I would like to see how well the model predicts topics on unseen data. Thus, I am using the predict() function from the textminer package in combination with GIBBS sampling on my testset-sample.
This results in predicted "Theta" values for each document in my testset-sample.
While I have read in another post that perplexity-calculations are not available with the texminer package (See this post here: How do i measure perplexity scores on a LDA model made with the textmineR package in R?), I am now wondering what the purpose of the prediction function is then for? Especially with a large dataset of over 100.000 Documents it is hard to just visually assess whether the prediction has performed well or not.
I do not want to use perplexity for model selection (I am using coherence/log-likelihood instead), but as far as I understand, perplexity would help me to understand how well the prediction is and how "surprised" the model is with new, previously unseen data.
Since this does not seem to be available for textmineR, I am not sure how to assess the model prediction. Is there anything else that I could use to measure the prediction quality of my textminer model?
Thank you!

How to train a multiple linear regression model to find the best combination of variables?

I want to run a linear regression model with a large number of variables and I want an R function to iterate on good combinations of these variables and give me the best combination.
The glmulti package will do this fairly effectively:
Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.
Unsolicited advice follows:
HOWEVER. Please be aware that while this approach can find the model that minimizes within-sample error (the goodness of fit on your actual data), it has two major problems that should make you think twice about using it.
this type of data-driven model selection will almost always destroy your ability to make reliable inferences (compute p-values, confidence intervals, etc.). See this CrossValidated question.
it may overfit your data (although using the information criteria listed in the package description will help with this)
There are a number of different ways to characterize a "best" model, but AIC is a common one, and base R offers step(), and package MASS offers stepAIC().
summary(lm1 <- lm(Fertility ~ ., data = swiss))
slm1 <- step(lm1)
summary(slm1)
slm1$anova

Machine Learning Keras accuracy model vs accuracy new data prediction

I did a deep learning model using keras. Model accuracy has 99% score.
$`loss`
[1] 0.03411416
$acc
[1] 0.9952607
When I do a prediction classes on my new data file using the model I have only 87% of classes well classified. My question is, why there is a difference between model accuracy and model prediction score?
Your 99% is on the Training Set, this is an indicator of own is performing your algorithm while training, you should never look at it as a reference.
You should always look at your Test Set, this is the real value that matters.
Fore more, your accuracies should always look like this (at least the style):
e.g. The training set accuracy always growing and the testing set following the same trend but below the training curve.
You will always never have the exact two same sets (training & testing/validating) so this is normal to have a difference.
The objective of the training set is to generalize your data and learn from them.
The objective of the testing set is to see if you generalized well.
If you're too far from your training set, either there a lot of difference between the two sets (mostly distribution, data types etc..), or if they are similar then your model overfits (which means your model is too close to your training data and if there is a little difference in your testing data, this will lead to wrong predictions).
The reason the model overfits is often that your model is too complicated and you must simplify it (e.g. reduce number of layers, reduce number of neurons.. etc)

can we get probabilities the same way that we get them in logistic regression through random forest?

I have a data structure with binary 0-1 variable (click & Purchase; click & not-purchase) against a vector of the attributes. I used logistic regression to get the probabilities of the purchase. How can I use Random Forest to get the same probabilities? Is it by using Random Forest regression? or is it Random Forest classification with type='prob' in R which gives the probability of categorical variable?
It won't give you the same result since the structure of the two method are different. Logistic regression is given by a definitive linear specification, where RF is a collective vote from multiple independent/random trees. If specification and input feature are properly tuned for both, they can produce comparable results. Here is the major difference between the two:
RF will give more robust fit against noise, outliers, overfitting or multicollinearity etc which are common pitfalls in regression type of solution. Basically if you don't know or don't want to know much about whats going in with the input data, RF is a good start.
logistic regression will be good if you know expertly about the data and how to properly specify the equation. Or somehow want to engineer how the fit/prediction works. The explicit form of GLM specification will allow you to do that.

Resources