quantile normalization on test data using parameters from train data - r

I'm using quantile normalization on a prediction problem (which has a train data set and a test data set).
(To understand quantile normalization : https://davetang.org/muse/2014/07/07/quantile-normalisation-in-r/)
I use preprocessCore R package for quantile normalization.
I'm currently performing quantile normalization separately on train data and test data.
Is it possible to somehow obtain normalization parameters for this normalization from train data and then apply those parameters to test data ?
My understanding is that this is not possible since quantile normalization is done across samples, and also does not really have any parameters that can be determined and used on test data.
This is contrary to something like z-score normalization i.e. (feature value - mean) / std_dev, where the mean and std_dev for each of the features can be determined from the train data and applied to test data.
Can someone please let me know if that is possible in quantile normalization ?
Any references which would help me understand this better would be really useful.

Related

How to transform data after fitting a distribution with gamlss?

I have a data set where observations come from highly distinct groups. Each group may have a wildly different distribution, so I am trying to find the best distribution using fitdist from fitdistrplus, then use gamlssML from the gamlss package to find the best parameters.
My issue is with transforming the data after this step. For some of the distributions, like the Box-Cox t, I can find the equation for normalizing the data using the BCT coefficients, but for many of these distributions I cannot.
Does gamlss have a function that normalizes the data after fitting? Their documentation only provides the transformations for a small number of distributions https://www.gamlss.com/wp-content/uploads/2018/01/DistributionsForModellingLocationScaleandShape.pdf
Thanks a lot
The normalised data values (for any distribution) are exactly equal to the residuals from a gamlss fit,
m1 <- gamlss()
which can be accessed by
residuals(m1) or
m1$residuals

How to compute delong R with a nested cross validation

I need to compute the confidence interval and p value for comparison of AUCs, but I have a nested cross validation so I have 10 models instead of one and my final result is an average of it. As I do not have much data, I do not have an additional test set. But I have been asked to perform the comparison of AUC using delong and obtain the confidence intervals for my final result (the average of the nested cross validation).
I wonder if I can concatenate all the scores of the 10 models of the nested as an input of DeLong test in R.
I have seen in the manual that also use the original features (without probabilities) how can it be?. It performs a logistic regression internally to get them?
library(Daim)
M2 <- deLong.test(iris[,1:4], labels=iris[,5], labpos="versicolor")
Could I use the same idea ?

evaluate forecast by the terms of p-value and pearson correlation

I am using R to do some evaluations for two different forecasting models. The basic idea of the evaluation is do the comparison of Pearson correlation and it corresponding p-value using the function of cor.() . The graph below shows the final result of the correlation coefficient and its p-value.
we suggestion that model which has lower correlation coefficient with corresponding lower p-value(less 0,05) is better(or, higher correlation coefficient but with pretty high corresponding p-value).
so , in this case, overall, we would say that the model1 is better than model2.
but the question here is, is there any other specific statistic method to quantify the comparison?
Thanks a lot !!!
Assuming you're working with time series data since you called out a "forecast". I think what you're really looking for is backtesting of your forecast model. From Ruey S. Tsay's "An Introduction to Analysis of Financial Data with R", you might want to take a look at his backtest.R function.
backtest(m1,rt,orig,h,xre=NULL,fixed=NULL,inc.mean=TRUE)
# m1: is a time-series model object
# orig: is the starting forecast origin
# rt: the time series
# xre: the independent variables
# h: forecast horizon
# fixed: parameter constriant
# inc.mean: flag for constant term of the model.
Backtesting allows you to see how well your models perform on past data and Tsay's backtest.R provides RMSE and Mean-Absolute-Error which will give you another perspective outside of correlation. Caution depending on the size of your data and complexity of your model, this can be a very slow running test.
To compare models you'll normally look at RMSE which is essentially the standard deviation of the error of your model. Those two are directly comparable and smaller is better.
An even better alternative is to set up training, testing, and validation sets before you build your models. If you train two models on the same training / test data you can compare them against your validation set (which has never been seen by your models) to get a more accurate measurement of your model's performance measures.
One final alternative, if you have a "cost" associated with an inaccurate forecast, apply those costs to your predictions and add them up. If one model performs poorly on a more expensive segment of data, you may want to avoid using it.
As a side-note, your interpretation of a p value as less is better leaves a little to be [desired] quite right.
P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis.

R: Limit/Set values of predicted results from linear model

New to R.
Looking to limit the range of values that can be predicted.
df.Train <- data.frame(S=c(1,2,2,2,1),L=c(1,2,3,3,1),M=c(400,450,400,700,795),V=c(423,400,555,600,800),G=c(4,3.2,2,2.7,3.4), stringsAsFactors=FALSE)
m.Train <- lm(G~S+L+M+V,data=df.Train)
df.Test <- data.frame(S=c(1,2,1,2,1),L=c(1,2,3,1,1),M=c(400,450,500,800,795),V=c(423,475,555,600,555), stringsAsFactors=FALSE)
round(predict(m.Train, df.Test, type="response"),digits=1)
#seq(0,4,.1) #Predicted values should fall in this range
I've experimented with the predict() options but no luck.
Is there an option in predict? Should I be limiting it in the model?
Thank you
There are ways to transform your response variable, G in this occasion but there needs to be a good reason to do this. For example, if you want the output to be probabilities between 0 and 1 and your response variable is binary (0,1) then you need a logistic regression.
It all comes down to what data you have and whether a model / transformation of the response variable would be appropriate. In your example you do not specify what the data is and therefore we cannot say anything about which model or which transformation to use.
Setting the above on the side, if you really care about the prediction and do not care about the model or the transformation (but why wouldn't you care?) it looks like your data could use a quasipossion generalised linear model which might provide the output you need:
df.Train <- data.frame(S=c(1,2,2,2,1),L=c(1,2,3,3,1),M=c(400,450,400,700,795),V=c(423,400,555,600,800),G=c(4,3.2,2,2.7,3.4), stringsAsFactors=FALSE)
m.Train <- glm(G~S+L+M+V,data=df.Train, family=quasipoisson)
df.Test <- data.frame(S=c(1,2,1,2,1),L=c(1,2,3,1,1),M=c(400,450,500,800,795),V=c(423,475,555,600,555), stringsAsFactors=FALSE)
> predict(m.Train, df.Test, type="response")
1 2 3 4 5
4.000000 2.840834 3.062754 3.615447 4.573276
#probably not as good as you want
The model is using a log link by default which ensures the values will be positive. There is no guarantee that the model will not predict values greater than 4 but since you fed it values of less than 4 (your G variable) then chances are that most of the predictions will follow that distribution (like in this example). You might then need to consider how to treat predictions that go above 4.
In general you should consider carefully which model to choose and which response transformation. The poison model above for example is usually used for count data. However, you should never manipulate predictions on your own so if you choose the lm model in the end make sure you use the predictions it gives.
EDIT
It looks like in your case a non-linear regression might be what you need. The problem using a linear model like lm is that predictions can be greater than the max of the observed cases and less than the min of the observed cases. In which case doing a linear regression might not be appropriate. There are algorithms that will never predict a value greater than the max or less than the min. Such a case might be better suited in your case. One of these algorithms is the k-nearest neighbor for example:
library(FNN)
> knn.reg(df.Train[1:4], test=df.Test[1:4], y=df.Train[5], k=3)
Prediction:
[1] 3.066667 3.066667 3.066667 2.700000 3.100000
As you can see the predictions will never go above 4. That said knn is a local solution algorithm so again you need to research whether this is a good approach or not for your problem and your data. In terms of predictions though it definitely confirms your conditions. Knn is a very easy to understand algorithm that relies on distances between points to calculate predictions.
Hope it helps :)

Leave one out cross validation with lm function in R

I have a dataset of 506 rows on which I am performing Leave-one-out Cross Validation, once I get the mean squared errors , I am computing the mean of the mean squared errors I found. This is changing everytime I run it. Is this expected ? If so, Can someone please explain why is it changing everytime I run it ?
To do leave one out CV, I shuffle the rows first , df is the data frame
df <-df[sample.int(nrow(df)),]
Then, I split the dataframe into 506 data frames and send it to lm() and get the MSE for each data frame (in this case, each row)
fit <- lm(train[,lastcolumn] ~.,data = train)
pred <- predict(fit,test)
pred <- mean((pred - test[,lastcolumn])^2)
And then I take the mean of all the MSEs I got.
Everytime I run all this , I get a different mean. Is this expected ?
Leave-one-out cross-validation is a validation paradigm. You have to state what algorithm you are using for your predictions and you have to look whether there is some random initialization of the parameters in the prediction algorithm. If that initialization changes randomly that could explain a different result everytime the underlying algorithm is run. You have to mention which estimator / prediction algorithm you are using. If you use a Gaussian Mixture Model e.g. for classification with different initialization for means and covariances that would be a possible algorithm where performance is not necessarily always the same in a LOOCV. Gaussian mixture models and K-means algorithms typically randomize the selection of data points to represent a mean. Also the number of Gaussians in the mixture could change with different initializations if an information theoretic criterion i used for estimating the number of Gaussians.

Resources