random forest package in R - r

I use random forest package in R for regression, it gives me two kind of information: Mean of squared residuals and % Var explained. But I wanna calculate the RMSE and R^2 of the training and test sets, can anyone help me how can I find these kind of information?

Sorry this is not a specific answer, but I do not have enough cred to leave a comment.
It is tough to say how you may get at what you want without a reproducible example. However, if you used the xtest= and ytest= arguments in the call to randomForest (assuming you are using the "randomForest" package), then what you are looking for should be a part of the resulting randomForest object. What you want to look in is the test part of the resulting random forest list.
An attempted example:
rf.results <- randomForest( whatever arguments )
rf.results$test$mse # mse (maybe you can take the square root to get rmse)
rf.results$test$rsq # pseudo-R2 for random forest
If you have the random forest package loaded you can validate this information as well as do some exploration yourself with ?randomForest. The "Value" section of the documentation details the object that results from a call to randomForest and where you can find various performance metrics.

Related

Quantile Regression with Time-Series Models (ARIMA-ARCH) in R

I am working on quantile forecasting with time-series data. The model I am using is ARIMA(1,1,2)-ARCH(2) and I am trying to get quantile regression estimates of my data.
So far, I have found "quantreg" package to perform quantile regression, but I have no idea how to put ARIMA-ARCH models as the model formula in function rq.
rq function seems to work for regressions with dependent and independent variables but not for time-series.
Is there some other package that I can put time-series models and do quantile regression in R? Any advice is welcome. Thanks.
I just put an answer on the Data Science forum.
It basically says that most of the ready made packages are using so called exact test based on assumption on the distribution (independent identical normal-Gauss distribution, or wider).
You also have a family of resampling methods in which you simulate a sample with a similar distribution of your observed sample, perform your ARIMA(1,1,2)-ARCH(2) and repeat the process a great number of times. Then you analyze this great number of forecast and measure (as opposed to compute) your confidence intervals.
The resampling methods differs in the way to generate the simulated samples. The most used are:
The Jackknife: in which you "forget" one point, that is you simulate a n samples of size n-1 (if n is the size of the observed sample).
The Bootstrap: in which you simulate a sample by taking n values of the original sample with replacements: some will be taken once, some twice or more, some never,...
It is a (not easy) theorem that the expectation of the confidence intervals, as most of the usual statistical estimators, are the same on the simulated sample than on the original sample. With the difference that you can measure them with a great number of simulations.
Hello and welcome to StackOverflow. Please take some time to read the help page, especially the sections named "What topics can I ask about here?" and "What types of questions should I avoid asking?". And more importantly, please read the Stack Overflow question checklist. You might also want to learn about Minimal, Complete, and Verifiable Examples.
I can try to address your question, although this is hard since you don't provide any code/data. Also, I guess by "put ARIMA-ARCH models" you actually mean that you want to make an integrated series stationary using an ARIMA(1,1,2) plus an ARCH(2) filters.
For an overview of the R time-series capabilities you can refer to the CRAN task list.
You can easily apply these filters in R with an appropriate function.
For instance, you could use the Arima() function from the forecast package, then compute the residuals with residuals() from the stats package. Next, you can use this filtered series as input for the garch() function from the tseries package. Other possibilities are of course possible. Finally, you can apply quantile regression on this filtered series. For instance, you can check out the dynrq() function from the quantreg package, which allows time-series objects in the data argument.

How are the predictions obtained?

I have been unable to find information on how exactly predict.cv.glmnet works.
Specifically, when a prediction is being made are the predictions based on a fit that uses all the available data? Or are predictions based on a fit where some data has been discarded as part of the cross validation procedure when running cv.glmnet?
I would strongly assume the former but was unable to find a sentence in the documentation that clearly states that after a cross validation is finished, the model is fitted with all available data for a new prediction.
If I have overlooked a statement along those lines, I would also appreciate a hint on where to find this.
Thanks!
In the documentation for predict.cv.glmnet :
"This function makes predictions from a cross-validated glmnet model, using the stored "glmnet.fit" object ... "
In the documentation for cv.glmnet (under value):
"glmnet.fit a fitted glmnet object for the full data."

Random Forest Crossvalidation in R

I am working on a random forest in R and I would like to add the 10- folds cross validation to my model. But I am quite stuck there.
This is sample of my code.
install.packages('randomForest')
library(randomForest)
set.seed(123)
fit <- randomForest(as.factor(sickrabbit) ~ Feature1,..., FeatureN ,data=training1, importance=TRUE,sampsize = c(200,300),ntree=500)
I found online the function rfcv in caret but I am not sure to understand how it works. Can anyone help with this function or propose an easier way to implement cross validation. Can you do it using random forest package instead of caret?
You don't need to cross-validate a random forest model. You are getting stuck with the randomForest package because it wasn't designed to do this.
Here is a snippet from Breiman's official documentation:
In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:
Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.
Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.

R - How to get one "summary" prediction map instead for 5 when using 5-fold cross-validation in maxent model?

I hope I have come to the right forum. I'm an ecologist making species distribution models using the maxent (version 3.3.3, http://www.cs.princeton.edu/~schapire/maxent/) function in R, through the dismo package. I have used the argument "replicates = 5" which tells maxent to do a 5-fold cross-validation. When running maxent from the maxent.jar file directly (the maxent software), an html file with statistics will be made, including the prediction maps. In R, an html file is also made, but the prediction maps have to be extracted afterwards, using the function "predict" in the dismo package in r. When I do this, I get 5 maps, due to the 5-fold cross-validation setting. However, (and this is the problem) I want only one output map, one "summary" prediction map. I assume this is possible, although I don't know how maxent computes it. The maxent tutorial (see link above) says that:
"...you may want to avoid eating up disk space by turning off the “write output grids” option, which will suppress writing of output grids for the replicate runs, so that you only get the summary statistics grids (avg, stderr etc.)."
A list of arguments that can be put into R is found in this forum https://groups.google.com/forum/#!topic/maxent/yRBlvZ1_9rQ.
I have tried to use the argument "outputgrids=FALSE" both in the maxent function itself, and in the predict function, but it doesn't work. I still get 5 maps, even though I don't get any errors in R.
So my question is: How do I get one "summary" prediction map instead of the five prediction maps that results from the cross-validation?
I hope someone can help me with this, I am really stuck and haven't found any answers anywhere on the internet. Not even a discussion about this. Hope my question is clear. This is the R-script that I use:
model1<-maxent(x=predvars, p=presence_points, a=target_group_absence, path="//home//...//model1", args=c("replicates=5", "outputgrids=FALSE"))
model1map<-predict(model1, predvars, filename="//home//...//model1map.tif", outputgrids=FALSE)
Best regards,
Kristin
Sorry to be the bearer of bad news, but based on the source code, it looks like Dismo's predict function does not have the ability to generate a summary map.
Nitty-gritty details for those who care: When you call maxent with replicates set to something greater than 1, the maxent function returns a MaxEntReplicates object, rather than a normal MaxEnt object. When predict receives a MaxEntReplicates object, it just iterates through all of the models that it contains and calls predict on them individually.
So, what next? Fortunately, all is not lost! The reason that Dismo doesn't have this functionality is that for most kinds of model-building, there isn't actually a valid way to average parameters across your cross-validation models. I don't want to go so far as to say that that's definitely the case for MaxEnt specifically, but I suspect it is. As such, cross-validation is usually used more as a way of checking that your model building methodology works for your data than as a way of building your model directly (see this question for further discussion of that point). After verifying via cross-validation that models built using a given procedure seem to be accurate for the phenomenon you're modelling, it's customary to build a final model using all of your data. In theory this new model should only be better than models trained on a subset of your data.
So basically, assuming your cross-validated models look reasonable, you can run MaxEnt again with only one replicate. Your final result will be a model accuracy estimate based on the cross-validation and a map based on the second run with all of your data lumped together. Depending on what exactly your question is, there might be other useful summary statistics from the cross-validation that you want to use, but those are all things you've already seen in the html output.
I may have found this a couple of years later. But you could do something like this:
xm <- maxent(predictors, pres_train) # basically the maxent model
px <- predict(predictors, xm, ext=ext, progress= '' ) #prediction
px2 <- predict(predictors, xm2, ext=ext, progress= '' ) #prediction #02
models <- stack(px,px2) # create a stack of prediction from all the models
final_map <- mean(px,px2) # Take a mean of all the prediction
plot(final_map) #plot the averaged map
xm1,xm2,.. would be the maxent models for each partitions in cross-validation, and px, px2,.. would be the predicted maps.

randomForest in R: Is there a possibility of calculating casewise confidence intervals?

R package randomForest reports mean squared errors for each tree in the forest. I need, however, a measure of confidence for each case in the data. Since randomForest calculates the casewise predictions by averaging the predictions of the single trees, I guess that it should also be possible to calculate a casewise standard error and thus a confidence interval. Can this be done using the output randomForest object (if so: how?) or do I have to dig into the source code?
No need to dig into the source code. You only need to read the documentation. ?predict.randomForest states that one of its arguments is called predict.all:
predict.all Should the predictions of all trees be kept?
So setting that to TRUE will keep a prediction for each case, for each tree, which you can then use to calculate standard error for each case.
I have recently been made aware of this paper by Stefan Wager, Trevor Hastie and Brad Efron which investigates more rigorously the idea of standard errors for the predictions generated by random forests (and other bagged predictors).

Resources