How to get chi-square p-value from gofstat in R - r

I'm trying to compare sample data with lists of distributions (thanks to help on StackOverflow), but have hit a bit of a roadblock. gofstat seems to be working splendidly, and the graphical output is exactly what is desired. However, the final point of this piece of code is to find the most fitting distribution to the sample data (which will eventually be read from a text file, and will not be ideal at all), and the parameters of said distribution.
The first step for me is to find the appropriate p-value for the chi-square test statistic for each distribution and the data, and then find the largest of these p-values, which should indicate the most fitting distribution. However, I can't seem to get proper output from the code. Whenever I run the code below, I receive a NULL output (twice, because of the loop in the code). According to documentation, this is the output when the test statistic is not calculated. How do I get gofstat to calculate it, and display the p-value? (and the distribution parameters, if possible)
library(fitdistrplus)
set.seed(1)
testData <- lnorm(1000)
distlist <- c("norm","unif")
# Loop through list of distributions
for(i in 1:length(distlist)){
x <- fitdist(testData, distlist[i])
gofstat(x)
print(x$chisqpvalue)
plot(x)
}

Related

qq plot in R to check normality of the distribution?

I have been reading a tutorial from https://www.datanovia.com/en/lessons/anova-in-r/ on how to perform ANOVA test in R. However, my question is regarding checking normality of the distribution in general.
There is an option to do a QQ plot with the ggqqplot function. However I do not know how to define the function. From what I can see in the tutorial on the datanovia, they use residuals from the linear model:
# Build the linear model
model <- lm(weight ~ group, data = PlantGrowth)
# Create a QQ plot of residuals
ggqqplot(residuals(model)
Then I performed the same test this way:
ggqqplot(PlantGrowth, "weight")
I expected to see the same result; however, the results differ.
From the documentation of the function ggqqplot it is not clear to me how is it correct to define it. Does someone have an explanation?
Thanks :D
You would just do ggqqplot(PlantGrowth) as long as that variable is a vector of numeric values. the function only takes a vector of numeric values and gives you something like this ex: ggqqplot(iris$Sepal.Length)

LASSO analysis (glmnet package). Can I loop the analysis and the results extraction?

I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.

R: use forecast::accuracy() on split data

Having a hard time getting the accuracy() function from {forecast} to work on predicted test values.
First, build the LM model on the training data (here for reproducibility):
library(ISLR)
set.seed(1)
train <- sample(392, 196)
lm.fit <- lm(mpg~horsepower, data = Auto, subset = train)
Then compute the MSE of the test data:
mean((auto$mpg - predict(lm.fit, Auto))[-train]^2)
My goal is to use forecast::accuracy() to get MSE (rather than the above) and additional measures of error. However, I simply can not get it to run, no matter what I feed it. This is definitely user error, and looking for any thoughts out there.
I know forecast::accuracy() does not contain MSE "out of the box" but I plan on computing it via accuracy(data)[, 2]^2 and merging with the other output.
accuracy(forecast(lm.fit, newdata=Auto[-train,]), Auto$mpg[-train])[,2]^2

Leave one out cross validation with lm function in R

I have a dataset of 506 rows on which I am performing Leave-one-out Cross Validation, once I get the mean squared errors , I am computing the mean of the mean squared errors I found. This is changing everytime I run it. Is this expected ? If so, Can someone please explain why is it changing everytime I run it ?
To do leave one out CV, I shuffle the rows first , df is the data frame
df <-df[sample.int(nrow(df)),]
Then, I split the dataframe into 506 data frames and send it to lm() and get the MSE for each data frame (in this case, each row)
fit <- lm(train[,lastcolumn] ~.,data = train)
pred <- predict(fit,test)
pred <- mean((pred - test[,lastcolumn])^2)
And then I take the mean of all the MSEs I got.
Everytime I run all this , I get a different mean. Is this expected ?
Leave-one-out cross-validation is a validation paradigm. You have to state what algorithm you are using for your predictions and you have to look whether there is some random initialization of the parameters in the prediction algorithm. If that initialization changes randomly that could explain a different result everytime the underlying algorithm is run. You have to mention which estimator / prediction algorithm you are using. If you use a Gaussian Mixture Model e.g. for classification with different initialization for means and covariances that would be a possible algorithm where performance is not necessarily always the same in a LOOCV. Gaussian mixture models and K-means algorithms typically randomize the selection of data points to represent a mean. Also the number of Gaussians in the mixture could change with different initializations if an information theoretic criterion i used for estimating the number of Gaussians.

R prediction interval for the mean of the new sample

Given a regression model created from one dataset, I have been using WinBUGS to construct prediction intervals (PIs) around the mean of a second dataset. I have just discovered the "predict" function in R, but it delivers PIs around each predicted value in the second dataset. I have searched the R help, here and on the Net and only found the intervals for the separate members.
The average of the these intervals is clearly not the same as the PI around the predicted sample mean (and I have tested that against the value I got from WinBUGS).
How do I get R to give me the PI around the mean?
There used to be an R mean.data.frame function, but it was deprecated and then removed. You can get the same result with:
mean.vec <- lapply(na.omit(dfrm), mean)
Then probably just:
predict(fit, newdata=data.frame(mean.vec) )
I say 'probably' because you provided no dataset to test this with and provision of such is in my opinion your responsibility. I have no idea whether this replicates the JMP method or the WinBUGS method.

Resources