I built a neural network using the AMES dataset and took the log of the target variable 'SalePrice' in order for it to have an approximately normal distribution. My loss function is MAE, and the MAE values of my 'new' NN are significantly lower than that from my 'original' model. Is there any way that I can retranslate these loss values? Is it really as simple as just taking the exponential?
Related
I know when random forest (RF) is used for classification, the AUC normally is used to assess the quality of classification after applying it to test data. However,I have no clue the parameter to assess the quality of regression with RF. Now I want to use RF for the regression analysis, e.g. using a metrics with several hundreds samples and features to predict the concentration (numerical) of chemicals.
The first step is to run randomForest to build the regression model, with y as continuous numerics. How can I know whether the model is good or not, based on the Mean of squared residuals and % Var explained? Sometime my % Var explained is negative.
Afterwards, if the model is fine and/or used straightforward for test data, and I get the predicted values. Now how can I assess the predicted values good or not? I read online some calculated the accuracy (formula: 1-abs(predicted-actual)/actual), which also makes sense to me. However, I have many zero values in my actual dataset, are there any other solutions to assess the accuracy of predicted values?
Looking forward to any suggestions and thanks in advance.
The randomForest R package comes with an importance function which can used to determine the accuracy of a model. From the documentation:
importance(x, type=NULL, class=NULL, scale=TRUE, ...), where x is the output from your initial call to randomForest.
There are two types of importance measurements. One uses a permutation of out of bag data to test the accuracy of the model. The other uses the GINI index. Again, from the documentation:
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
For further information, one more simple importance check you may do, really more of a sanity check than anything else, is to use something called the best constant model. The best constant model has a constant output, which is the mean of all responses in the test data set. The best constant model can be assumed to be the crudest model possible. You may compare the average performance of your random forest model against the best constant model, for a given set of test data. If the latter does not outperform the former by at least a factor of say 3-5, then your RF model is not very good.
Due to computational limitations with my GIS software, I am trying to implement random forests in R for image classification purposes. My input is a multi-band TIFF image, which is being trained on an ArcGIS shapefile (target values 0 and 1). The code technically works and produces a valid output. When I view the confusion matrix I get the following:
0 1 class.error
0 11 3 0.214285714
1 1 13 0.071428571
This is sensible for my data. However when I plot up the output of the image classification in my GIS software (the binary reclassified tiff with values 0 and 1), it predicts the training data with a 100% success rate. In other words there is no classification error with the output image. How is this the case when the confusion matrix indicates there are classification errors?
Am I missing something really obvious here? Code snippet below.
rf.mdl <- randomForest(x=samples#data[, names(PredMaps)], y=samples#data[, ValueFld], ntree=501, proximity=TRUE, importance=TRUE, keep.forest=TRUE,keep.inbag=TRUE)
ConfMat = rf.mdl$confusion
write.csv(ConfMat,file = "ConfMat1.csv")
predict(PredMaps, rf.mdl, filename=classifiedPath, type="response", na.rm=T, overwrite=T, progress="text")
I expected the output classified image to misclassify 1 of the Value=1 training points and misclassify 3 of the Value=0 training points based on what is indicated in the confusion matrix.
The Random Forest algorithm is a bagging method. This means it creates numerous weak classifiers, then has each weak classifier "vote" to create the end prediction. In RF, each weak classifier is one decision tree that is trained on a random sample of observations in the training set. Think of the random samples each decision tree is trained on as a "bag" of data.
What is being shown in the confusion matrix is something called "out-of-bag error" (OOB error). This OOB error is an accurate estimate of how your model would generalize to data it has never seen before (this estimate is usually achieved by testing your model on a withheld testing set). Since each decision tree is trained on only one bag from your training data, the rest of the data (data that's "outside the bag") can stand in for this withheld data.
OOB error is calculated by making a prediction for each observation in the training set. However, when predicting each individual observation, only decision trees whose bags did not include that observation are allowed to participate in the voting process. The result is the confusion matrix available after training a RF model.
When you predict the observations in the training set using the complete model, decision trees whose bags did include each observation are now involved in the voting process. Since these decision trees "remember" the observation they were trained on, they skew the prediction toward the correct answer. This is why you achieve 100% accuracy.
Essentially, you should trust the confusion matrix that uses OOB error. It's a robust estimate of how the model will generalize to unseen data.
I am using R to do some evaluations for two different forecasting models. The basic idea of the evaluation is do the comparison of Pearson correlation and it corresponding p-value using the function of cor.() . The graph below shows the final result of the correlation coefficient and its p-value.
we suggestion that model which has lower correlation coefficient with corresponding lower p-value(less 0,05) is better(or, higher correlation coefficient but with pretty high corresponding p-value).
so , in this case, overall, we would say that the model1 is better than model2.
but the question here is, is there any other specific statistic method to quantify the comparison?
Thanks a lot !!!
Assuming you're working with time series data since you called out a "forecast". I think what you're really looking for is backtesting of your forecast model. From Ruey S. Tsay's "An Introduction to Analysis of Financial Data with R", you might want to take a look at his backtest.R function.
backtest(m1,rt,orig,h,xre=NULL,fixed=NULL,inc.mean=TRUE)
# m1: is a time-series model object
# orig: is the starting forecast origin
# rt: the time series
# xre: the independent variables
# h: forecast horizon
# fixed: parameter constriant
# inc.mean: flag for constant term of the model.
Backtesting allows you to see how well your models perform on past data and Tsay's backtest.R provides RMSE and Mean-Absolute-Error which will give you another perspective outside of correlation. Caution depending on the size of your data and complexity of your model, this can be a very slow running test.
To compare models you'll normally look at RMSE which is essentially the standard deviation of the error of your model. Those two are directly comparable and smaller is better.
An even better alternative is to set up training, testing, and validation sets before you build your models. If you train two models on the same training / test data you can compare them against your validation set (which has never been seen by your models) to get a more accurate measurement of your model's performance measures.
One final alternative, if you have a "cost" associated with an inaccurate forecast, apply those costs to your predictions and add them up. If one model performs poorly on a more expensive segment of data, you may want to avoid using it.
As a side-note, your interpretation of a p value as less is better leaves a little to be [desired] quite right.
P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis.
I have received AUCs and prediction from a collaborated generated in Weka. The statistical model behin that was cross validated, so my dataset with the predictions includes columns for fold, predicted probability and true class. Using this data I was unable to replicate the AUCs given the predicted probabilities in R. The values always differ slightly.
Additional details:
Weka was used via GUI, not command line
I checked the AUC in R with packages pROC and ROCR
I first tried calculating the AUC over the collected predictions (without regard to fold) and I got different AUCs
Then I tried calculating the AUCs per fold and averaging. This did also not match.
The model was ridge logistic regression and there is a single tie in the predictions
The first fold has one sample more than the others. I have tried taking a weighted average, but this did not work out either
I have even tested averaging the AUCs after logit-transformation (for normality)
Taking the median instead of the mean did not help either
I am familiar with how the AUC is calculated in R, but I don't see what Weka could do differently.
I have used the following R packages: mice, mitools, and pROC.
Basic design: 3 predictor measures with missing data rates between 5% and 70% on n~1,000. 1 binary target outcome variable.
Analytic Goal: Determine the AUROC of each of the 3 predictors.
I used the mice package to impute data and now have m datasets of imputed data.
Using the following command, I am able to get AUROC curves for each of m datasets:
fit1<-with(imp2, (roc(target, symptom1, ci=TRUE)))
fit2<-with(imp2, (roc(target, symptom2, ci=TRUE)))
fit3<-with(imp2, (roc(target, symptom3, ci=TRUE)))
I can see the estimates for each of m datasets without any problems.
fit1
fit2
fit3
To combine the parameters, I attempted to use mitools
>summary(pool(fit1))
>summary(pool(fit2))
>summary(pool(fit3))
I get the following error message:
"Error in pool(fit): Object has no vcov() method".
When combining coefficient estimates from m datasets, my understanding is that this is a simple average of the coefficients. However, the error term is more complex.
My question: How do I pool the "m" ROC parameter estimates (AUROC and 95% C.I. or S.E.) to get an accurate estimate of the error term for significance testing/95% Confidence Intervals?
Thank you for any help in advance.
I think the following works to combine the estimates.
pROC produces a point estimate for the AUROC as well as a 95% Confidence Interval.
To combine the AUROC from m imputation dataets, it is simply averaging the AUROC.
To create an appropriate standard error estimate and then a 95% C.I., I converted the 95% C.I.s into S.E. Using the standard formulas (Multiple Imputation FAQ, I computed the within, between, and total variance for the estimate. Once I had the standard error, I converted that back to a 95% C.I.
If anyone has any better suggestions, I would very much appreciate it.
I would use bootstrapping with the boot package to assess the different sources of variance. For instance for the variance due to imputation, you could use something like this:
bootstrap.imputation <- function(d, i, symptom){
sampled.data <- d[i,]
imputed.data <- ... # here the code you use to generate one imputed dataset, but apply it to sampled.data
auc(roc(imputed.data$target, imputed.data[[symptom]]))
}
boot.n <- 2000
boot(dataset, bootstrap.imputation, boot.n, "symptom1") # symptom1 is passed with ... to bootstrap.imputation
boot(dataset, bootstrap.imputation, boot.n, "symptom2")
boot(dataset, bootstrap.imputation, boot.n, "symptom3")
Then you can then do the same to assess the variance of the AUC. Impute your data, and apply the bootstrap again (or you can do with the built-in functions of pROC).