I need to compute the confidence interval and p value for comparison of AUCs, but I have a nested cross validation so I have 10 models instead of one and my final result is an average of it. As I do not have much data, I do not have an additional test set. But I have been asked to perform the comparison of AUC using delong and obtain the confidence intervals for my final result (the average of the nested cross validation).
I wonder if I can concatenate all the scores of the 10 models of the nested as an input of DeLong test in R.
I have seen in the manual that also use the original features (without probabilities) how can it be?. It performs a logistic regression internally to get them?
library(Daim)
M2 <- deLong.test(iris[,1:4], labels=iris[,5], labpos="versicolor")
Could I use the same idea ?
Related
I'm using quantile normalization on a prediction problem (which has a train data set and a test data set).
(To understand quantile normalization : https://davetang.org/muse/2014/07/07/quantile-normalisation-in-r/)
I use preprocessCore R package for quantile normalization.
I'm currently performing quantile normalization separately on train data and test data.
Is it possible to somehow obtain normalization parameters for this normalization from train data and then apply those parameters to test data ?
My understanding is that this is not possible since quantile normalization is done across samples, and also does not really have any parameters that can be determined and used on test data.
This is contrary to something like z-score normalization i.e. (feature value - mean) / std_dev, where the mean and std_dev for each of the features can be determined from the train data and applied to test data.
Can someone please let me know if that is possible in quantile normalization ?
Any references which would help me understand this better would be really useful.
So, frequency of data is monthly and it is stationary I have an ARIMA model using auto.arima. Couple tests are applied to the data before creating model like ACF,ADF etc.
y is my monthly time-series object using ts() function:
myarima=auto.arima(y, stepwise = F,approximation = F,trace=T)
Then I use forecast function:
forecast = forecast(myarima,h=10)
autoplot(forecast)
Since for this case, I did not create any train and test sets because my data has fluctutation at the end so if I create a train/test split since my test set should equal to the forecast horizon (last 10 months) then the model will not be able to understand fluctutations at the end since it will be the test. Would be great to be enlighten regarding the K-fold cross validation to avoid these kind of scenarios.
Without train and test split, after creating the model and visualizing the forecast, I went for tsCV():
myforecast_arima<-function(x,h){
forecast(auto.arima(x),stepwise=F,approximation=F,h=h)
}
error_myarima<-tsCV(y,myforecast_arima,h=10)
mean(arimaerror^2,na.rm=TRUE) #To get MSE
Then I get a kind of low MSE value which is around 0.30
So, my question is, is it trustworthy method to evaluate ARIMA models and then deploy using this pathway? or Should I use train/test split method? What would you guys prefer in general? Should I use any other method? and how can I determine the window parameter of tsCV() function? If my pathway is correct then how can I improve it? What are the biggest differences between K-Fold CV and tsCV() function?
Thank you!
I am using R to do some evaluations for two different forecasting models. The basic idea of the evaluation is do the comparison of Pearson correlation and it corresponding p-value using the function of cor.() . The graph below shows the final result of the correlation coefficient and its p-value.
we suggestion that model which has lower correlation coefficient with corresponding lower p-value(less 0,05) is better(or, higher correlation coefficient but with pretty high corresponding p-value).
so , in this case, overall, we would say that the model1 is better than model2.
but the question here is, is there any other specific statistic method to quantify the comparison?
Thanks a lot !!!
Assuming you're working with time series data since you called out a "forecast". I think what you're really looking for is backtesting of your forecast model. From Ruey S. Tsay's "An Introduction to Analysis of Financial Data with R", you might want to take a look at his backtest.R function.
backtest(m1,rt,orig,h,xre=NULL,fixed=NULL,inc.mean=TRUE)
# m1: is a time-series model object
# orig: is the starting forecast origin
# rt: the time series
# xre: the independent variables
# h: forecast horizon
# fixed: parameter constriant
# inc.mean: flag for constant term of the model.
Backtesting allows you to see how well your models perform on past data and Tsay's backtest.R provides RMSE and Mean-Absolute-Error which will give you another perspective outside of correlation. Caution depending on the size of your data and complexity of your model, this can be a very slow running test.
To compare models you'll normally look at RMSE which is essentially the standard deviation of the error of your model. Those two are directly comparable and smaller is better.
An even better alternative is to set up training, testing, and validation sets before you build your models. If you train two models on the same training / test data you can compare them against your validation set (which has never been seen by your models) to get a more accurate measurement of your model's performance measures.
One final alternative, if you have a "cost" associated with an inaccurate forecast, apply those costs to your predictions and add them up. If one model performs poorly on a more expensive segment of data, you may want to avoid using it.
As a side-note, your interpretation of a p value as less is better leaves a little to be [desired] quite right.
P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis.
I have two variables that I'd like to analyze with a 2x2 table, which is easy enough.
datatable=table(data$Q1data1, data$Q1data2) summary(datatable)
However, I need to weight each variable separately using two frequency weighting variables that I have. So far, I've only found the wtd.chi.sq function in the weights package, which only allows you to weight both variables by the same weighting variable.
In addition, I need to perform this 2x2 chi-square 1000 times using bootstrapping or some resampling method, so that I can eventually peek at the distribution of p-values.
I have a dataset of 506 rows on which I am performing Leave-one-out Cross Validation, once I get the mean squared errors , I am computing the mean of the mean squared errors I found. This is changing everytime I run it. Is this expected ? If so, Can someone please explain why is it changing everytime I run it ?
To do leave one out CV, I shuffle the rows first , df is the data frame
df <-df[sample.int(nrow(df)),]
Then, I split the dataframe into 506 data frames and send it to lm() and get the MSE for each data frame (in this case, each row)
fit <- lm(train[,lastcolumn] ~.,data = train)
pred <- predict(fit,test)
pred <- mean((pred - test[,lastcolumn])^2)
And then I take the mean of all the MSEs I got.
Everytime I run all this , I get a different mean. Is this expected ?
Leave-one-out cross-validation is a validation paradigm. You have to state what algorithm you are using for your predictions and you have to look whether there is some random initialization of the parameters in the prediction algorithm. If that initialization changes randomly that could explain a different result everytime the underlying algorithm is run. You have to mention which estimator / prediction algorithm you are using. If you use a Gaussian Mixture Model e.g. for classification with different initialization for means and covariances that would be a possible algorithm where performance is not necessarily always the same in a LOOCV. Gaussian mixture models and K-means algorithms typically randomize the selection of data points to represent a mean. Also the number of Gaussians in the mixture could change with different initializations if an information theoretic criterion i used for estimating the number of Gaussians.