how to reduce the Run time in Azure ML for decision tree and decision forest - azure-machine-learning-studio

I am trying to run a regression model for a data set containing over 2000000 rows. I tried using linear regression and boosted decision tree regression without tuning model hyperparameter, I didn't get the expected accuracy. so I tried to use Tune model hyperparameter for the boosted decision tree, the model runs over 20 min. the decision forest also takes to0 long (even without tuning model hyperparameter). Is there any way to reduce the runtime without compromising the result accuracy too much?
will sampling affect the output (say I took 0.5 as sampling rate)?

The execution time on AzureML Studio depends on the pricing tier. The free version does one node execution at time while the standard pricing tier do the execute multiple execution at one time.

Related

Review performance of smaller model subsets of a large Random Forest model?

I'm constrained by the memory footprint / size of my Random Forest model so would prefer the number of trees be as low as possible and the depth of trees to be as shallow as possible while minimizing any impact on performance. Rather than needing to set-up hyperparameter tuning to optimize for this, I am wondering whether I can just build one large Random Forest composed of many deep trees. From this, can I then get an estimate of the performance of hypothetical smaller models enclosed within (and save myself the time of hyperparameter tuning -- again I'm looking to just tune on those parameters that generally just need to be "big enough" for the data/problem)?
For example, if I build a model with 1500 trees, could I just extract 500 of these and build a prediction from these to give an estimate of the performance of using just 500 trees (if I do this repeatedly, each time evaluating performance on a holdout set, I figure this should give an estimate of the performance of building a model with 500 trees -- unless I'm missing something?) I should be able to do this similarly with max tree depth or minimum node size, correct?
How would I do this in R on a ranger model?
(Would appreciate any examples, with parsnip would be a bonus. Also guidance / verification that this is a reasonable approach to use to avoid hyperparameter tuning for Random Forest models for those hyperparameters that simply need to be "big"/"deep" enough would also be helpful.)

Is there a way to evaluate model during training?

I am working on a Machine Learning Project. I have set up a ML pipeline for various stages of project. The Pipeline goes like -
Data Extraction -> Data Validation -> Preprocessing -> Training -> Model Evaluation
Model Evaluation, takes place after training is completed to determine if a model is approved or rejected.
Now what I want is model evaluation to take place during training itself at any point.
Say at about when 60% of the training is complete, the training is stopped and model is evaluated, based on which if the model is approved, it resumes the training.
How can the above scenario be implemented?
No you should only evaluate during the testing time if you tries to evaluate during train time like this you cant get the perfect accuracy of your model. As 60% training is done only the model is not trained on full dataset it might gave you high accuracy but your model can be overfitted model.

Machine Learning Keras accuracy model vs accuracy new data prediction

I did a deep learning model using keras. Model accuracy has 99% score.
$`loss`
[1] 0.03411416
$acc
[1] 0.9952607
When I do a prediction classes on my new data file using the model I have only 87% of classes well classified. My question is, why there is a difference between model accuracy and model prediction score?
Your 99% is on the Training Set, this is an indicator of own is performing your algorithm while training, you should never look at it as a reference.
You should always look at your Test Set, this is the real value that matters.
Fore more, your accuracies should always look like this (at least the style):
e.g. The training set accuracy always growing and the testing set following the same trend but below the training curve.
You will always never have the exact two same sets (training & testing/validating) so this is normal to have a difference.
The objective of the training set is to generalize your data and learn from them.
The objective of the testing set is to see if you generalized well.
If you're too far from your training set, either there a lot of difference between the two sets (mostly distribution, data types etc..), or if they are similar then your model overfits (which means your model is too close to your training data and if there is a little difference in your testing data, this will lead to wrong predictions).
The reason the model overfits is often that your model is too complicated and you must simplify it (e.g. reduce number of layers, reduce number of neurons.. etc)

Feature selection and prediction accuracy in regression Forest in R

I am attempting to solve a regression problem where the input feature set is of size ~54.
Using OLS linear regression with a single predictor 'X1', I am not able to explain the variation in Y - hence I am trying to find additional important features using Regression forest (i.e., Random forest regression). The selected 'X1' is later found to be the most important feature.
My dataset has ~14500 entries. I have separated it into training and test sets in the ratio 9:1.
I have the following questions:
when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?
forest <- randomForest(fmla, dTraining, ntree=501, importance=T)
mean((dTraining$y - predict(forest, data=dTraining))^2)
0.9371891
rSquared(dTraining$y, dTraining$y - predict(forest, data=dTraining))
0.7431078
mean((dTest$y - predict(forest, newdata=dTest))^2)
0.009771256
rSquared(dTest$y, dTest$y - predict(forest, newdata=dTest))
0.9950448
Please suggest.
Any suggestion if R-squared and MSE are good metrics for this problem, or if I need to look at some other metrics to evaluate if the model is good?
You should also try Cross Validated here
when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Only on the training data. You want to prevent overfitting, which is why you do a train-test split in the first place.
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
Yes, but the purpose of feature selection is not necessarily to speed up computation. With infinite features, it is possible to fit any pattern of data (i.e., overfitting). With feature selection, you're hoping to prevent overfitting by using only a few 'robust' features.
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?
Yes, it's unusual. You want low MSE and high R2 values for both your training and test data. (I would double check your calculations.) If you're getting high MSE and low R2 with your training data, it means your training was poor, which is very surprising. Also, I haven't used rSquared but maybe you want rSquared(dTest$y, predict(forest, newdata=dTest))?

identifying key columns/features used by decision tree regression

In Azure ML, I have a predictive regression model using boosted decision tree regression and it is reasonably accurate.
The input dataset has over 450 columns and the model has done a good job of predicting against test data sets, without over-fitting.
To report on the result i need to know what features/columns the model mainly used to make predictions but i cant find this information easily when looking at the trained model data.
How do i identify this information? Im happy to import the result dataset into R to help find this but I just need pointers on what direction to start working in.
Mostly, in using Microsoft Azure Machine Learning, when looking at the features that is mainly used to make predictions, it is found on the output of the Train Model module.
But on using Decision Trees as your algorithm, the output of your Train Model module would be the constructed 'trees' of the algorithm, and it looks like this:
To know the features that made impact on predictions while using Decision Trees algorithms, you can use the Permutation Feature Importance module. Look at the sample experiment below:
The parameters of Permutation Feature Importance are Random Seed and Metric for Measuring Performance (in this case, Regression - Coefficient of Determination)
The left input of Permutation Feature Importance is your trained model, and the right input is your test data.
The output of Permutation Feature Importance looks like this:
You can add Execute R Script to extract the Features and Scores from Permutation Feature Importance module.

Resources