I'm using the R-package randomForest version 4.6-14. The function randomForest takes a parameter localImp and if that parameter is set to true the function computes local explanations for the predictions. However, these explanations are for the provided training set. I want to fit a random forest model on a training set and use that model to compute local explanations for a separate test set. As far as I can tell the predict.randomForest function in the same package provides no such functionality. Any ideas?
Can you explain more about what it means to have some local explanation on a test set?
According to this answer along with the package document, the variable importance (or, the casewise importance implied by localImp) evaluates how the variable may affect the prediction accuracy. On the other hand, for the test set where there is no label to assess the prediction accuracy, the variable importance should be unavailable.
Related
The caret library in R has a hyper-parameter 'selectionFunction' inside trainControl().
It's used to prevent over-fitting models using Breiman's one standard error rule, or tolerance, etc.
Does mlr have an equivalent? If so, which function is it within?
mlr only allows to choose optimal hyperparameters by optimizing certain measures/metrics.
However, essentially each "measure" in mlr is just a function that specifies how a certain performance is handled.
You can try to write your own custom measure as outlined in this vignette.
Other than that, it might be worth opening this as a feature request in the new mlr3 framework, specifically in mlr3measures, since mlr itself is deprecated.
Posting an answer to my own question, I found this..
Estimate relative overfitting.
Source: R/relativeOverfitting.R
Estimates the relative overfitting of a model as the ratio of the difference in test and train performance to the difference of test performance in the no-information case and train performance. In the no-information case the features carry no information with respect to the prediction. This is simulated by permuting features and predictions.
estimateRelativeOverfitting(
predish,
measures,
task,
learner = NULL,
pred.train = NULL,
iter = 1
)
Arguments
predish - (ResampleDesc ResamplePrediction Prediction) Resampling strategy or resampling prediction or test predictions.
measures - (Measure list of Measure) Performance measure(s) to evaluate. Default is the default measure for the task, see here getDefaultMeasure.
task - (Task) The task.
learner - (Learner character(1)) The learner. If you pass a string the learner will be created via makeLearner.
pred.train - (Prediction) Training predictions. Only needed if test predictions are passed.
iter - (integer) Iteration number. Default 1, usually you don't need to specify this. Only needed if test predictions are passed.
I am using glmulti to select a set of candidate generalized linear models and my variable importance values and 'best' model keep changing each time I run the model.
I am struggling to understand why this is, does glmulti need a set.seed value to make results reproducible?
Thanks.
I believe the glmulti function is in the glmulti package. (You should state this in your question...) If so, the help page says that sometimes it uses a genetic algorithm to find the best model. Those are indeed random algorithms, so you can expect to get an answer that depends on the random number seed.
I am using R along with the neuralnet package see docs (https://cran.r-project.org/web/packages/neuralnet/neuralnet.pdf). I have used the neural network function to build and train my model.
Now I have built my model I want to test it on real data. Could someone explain if I should use the compute or prediction function? I have read the documentation and it isnt clear, both functions seem to do similar?
Thanks
The short answer is to use compute to do predictions.
You can see an example of using compute on the test set here. We can also see that compute is the right one from the documentation:
compute, a method for objects of class nn, typically produced by neuralnet. Computes the outputs
of all neurons for specific arbitrary covariate vectors given a trained neural network.
The above says that you can use covariate vectors in order to compute the output of the neural network i.e. make a prediction.
On the other hand prediction does what is mentioned in the title in the documentation:
Summarizes the output of the neural network, the data and the fitted
values of glm objects (if available)
Moreover, it only takes two arguments: the nn object and a list of glm models so there isn't a way to pass in the test set in order to make a prediction.
I use random forest package in R for regression, it gives me two kind of information: Mean of squared residuals and % Var explained. But I wanna calculate the RMSE and R^2 of the training and test sets, can anyone help me how can I find these kind of information?
Sorry this is not a specific answer, but I do not have enough cred to leave a comment.
It is tough to say how you may get at what you want without a reproducible example. However, if you used the xtest= and ytest= arguments in the call to randomForest (assuming you are using the "randomForest" package), then what you are looking for should be a part of the resulting randomForest object. What you want to look in is the test part of the resulting random forest list.
An attempted example:
rf.results <- randomForest( whatever arguments )
rf.results$test$mse # mse (maybe you can take the square root to get rmse)
rf.results$test$rsq # pseudo-R2 for random forest
If you have the random forest package loaded you can validate this information as well as do some exploration yourself with ?randomForest. The "Value" section of the documentation details the object that results from a call to randomForest and where you can find various performance metrics.
I'm using randomForest in order to find out the most significant variables. I was expecting some output that defines the accuracy of the model and also ranks the variables based on their importance. But I am a bit confused now. I tried randomForest and then ran importance() to extract the importance of variables.
But then I saw another command rfcv (Random Forest Cross-Valdidation for feature selection), which should be the most appropriate for this purpose I suppose, but the question I have regarding this is: how to get the list of the most important variables? How to see the output after running it? Which command to use?
Another thing: What is the difference between randomForest and predict.randomForest?
I am not very familiar with randomforest and R therefore any help would be appreciated.
Thank you in advance!
After you have made a randomForest model you use predict.randomForest to use the model you created on new data e.g. build a random forest with training data then run your validation data through that model with predict.randomForest.
As for the rfcv there is an option recursive which (from the help):
whether variable importance is (re-)assessed at each step of variable
reduction
Its all in the help file