Chi square goodness of fit for random numbers generated - r

I have used Inverse CDF method to generate 1000 samples from an exponential and a Cauchy random variable.
Now to verify whether these belong to their relevant distributions, I have to perform Chi-Squared Test for Goodness of fit.
I have tried two approaches (as below) -
Chisq.test(y) #which has 1000 samples from supposed exponential distribution
chisq.test(z) #cauchy
I am getting the following error:
data: y
X-squared = 234.0518, df = 999, p-value = 1
Warning message:
In chisq.test(y) : Chi-squared approximation may be incorrect
chisq.test(z)
Error in chisq.test(z) :
all entries of 'x' must be nonnegative and finite
I downloaded the vcd library to use goodfit()
and typed:
t1 <- goodfit(y,type= "exponential",method= "MinChiSq")
summary(t1)
In this case, the error message:
Error: could not find function "goodfit"
can somebody please guide on how to implement the Chi-Squared GOF test properly?
Note: The samples are not from normal distribution (exponential and cauchy respectively)
I am trying to understand if it is possible to get the observed and expected data instead with no luck so far.
edit - I did type in library(vcd) before writing the rest of the code. Apologies to have assumed it was obvious.

The chisq.test(...) function is designed primarily for use with counts, so it expects its arguments to be either countable (using table(...) for example), or to be counts already. It basically creates a contingency table for x and y (the first two arguments) and then uses the chisq test to determine if they are from the same distribution.
You are probably better off using the Kolmogorov–Smirnov test, which is designed for problems like yours. The K-S test compares the ecdf of the sample to the cdf of the test distribution and tests the null hypothesis that they are the same.
set.seed(1)
df <- data.frame(y = rexp(1000),
z = rcauchy(1000, 100, 100))
ks.test(df$y,"pexp")
# One-sample Kolmogorov-Smirnov test
#
# data: df$y
# D = 0.0387, p-value = 0.1001
# alternative hypothesis: two-sided
ks.test(df$z,"pcauchy",100,100)
# One-sample Kolmogorov-Smirnov test
#
# data: df$z
# D = 0.0296, p-value = 0.3455
# alternative hypothesis: two-sided
Note that in this case, the K-S test predicts a 90% chance that your sample df$y did not come from an exponential distribution, even though it clearly did.
You can use chisq.test(...) by artificially binning your data and then comparing the counts in each bin to what would be expected from your test distribution (using p=...), but this is convoluted and the answer you get depends on the number of bins.
breaks <- c(seq(0,10,by=1))
O <- table(cut(df$y,breaks=breaks))
p <- diff(pexp(breaks))
chisq.test(O,p=p, rescale.p=T)
# Chi-squared test for given probabilities
#
# data: O
# X-squared = 7.9911, df = 9, p-value = 0.535
In this case the chisq test predicts a 47% chance that your sample did not come from an exponential distribution.
Finally, even though they are qualitative, I find Q-Q plots to be very useful. These plot quantiles of your sample against quantiles of the test distribution. If the sample is drawn from the test distribution, the Q-Q plot should fall close to the line y=x.
par(mfrow=c(1,2))
plot(qexp(seq(0,1,0.01)),quantile(df$y,seq(0,1,0.01)),
main="Q-Q Plot",ylab="df$Y", xlab="Exponential",
xlim=c(0,5),ylim=c(0,5))
plot(qcauchy(seq(0,.99,0.01),100,100),quantile(df$z,seq(0,.99,0.01)),
main="Q-Q Plot",ylab="df$Z",xlab="Cauchy",
xlim=c(-1000,1000),ylim=c(-1000,1000))
Looking at the Q-Q plots gives me much more confidence in asserting that df$y and df$z are drawn, respectively, from the Exponential and Cauchy distributions than either the K-S or ChiSq tests, even though I can't put a number on it.

# Simulation
set.seed(123)
df <- data.frame(y = rexp(1000),
z = rcauchy(1000, 100, 100)
)
#This seems to be different, probably because of how you are simulating the data
chisq.test(df$y)
# Chi-squared test for given probabilities
#
# data: df$y
# X-squared = 978.485, df = 999, p-value = 0.6726
#
# Warning message:
# In chisq.test(df$y) : Chi-squared approximation may be incorrect
3 details:
1) you need to load the package. library(vcd)
2) There is no "exponential" type of distribution in the goodfit function
3) the method is MinChisq, Not MinChiSq
.
library(vcd)
t1 <- goodfit(df$y, type= "binomial", method= "MinChisq")
summary(t1)
# Goodness-of-fit test for binomial distribution
#
# X^2 df P(> X^2)
# Pearson 31.00952 6 2.524337e-05
# Warning message:
# In summary.goodfit(t1) : Chi-squared approximation may be incorrect

Related

Test for Poisson residuals in the analysis of variance model

I try to find any way for test Poisson residuals like normals in aov(). In my hypothetical example:
# For normal distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y1 <- rnorm(length(x), mean=10, sd=1.5)
#Normality test in aov residuals
y1.av<-aov(y1 ~ x)
shapiro.test(y1.av$res)
# Shapiro-Wilk normality test
#
#data: y1.av$res
#W = 0.99782, p-value = 0.7885
Sounds silly, OK!!
Now, I'll like to make a same approche but for Poisson distribution:
# For Poisson distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y2 <- rpois(x, lambda=10)
#Normality test in aov residuals
y2.av<-aov(y2 ~ x)
poisson.test(y2.av$res)
Error in poisson.test(y2.av$res) :
'x' must be finite, nonnegative, and integer
There is any stat approach for make this?
Thanks!
You could analyse your data below a counting context. Discrete data, such as variables of Poisson nature, can be analysed based on observed frequencies. You can formulate hypothesis testing for this task. Being your data y you can contrast the null hypothesis that y follows a Poisson distribution with some parameter lambda against the alternative hypothesis that y does not come from the Poisson distribution. Let's sketch the test with you data:
#Data
set.seed(123)
# For Poisson distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y2 <- rpois(x, lambda=10)
Now we obtain the counts, which are elemental for the test:
#Values
df <- as.data.frame(table(y2),stringsAsFactors = F)
df$y2 <- as.integer(df$y2)
After that we must separate the observed values O and its groups or categories classes. Both elements constitute the y variable:
#Observed values
O <- df$Freq
#Groups
classes <- df$y2
As we are testing a Poisson distribution, we must compute the lambda parameter. This can be obtained with Maximum Likelihood Estimation (MLE). The MLE for Poisson is the mean (considering we have counts and groups in order to determine this value), so we compute it with next code:
#MLE
meanval <- sum(O*classes)/sum(O)
Now, we have to get the probabilities of each class:
#Probs
prob <- dpois(classes,meanval)
Poisson distribution can go to infinite values, so we must compute the probability for the values that can be greater than our last group in order to have probabilities that sum to one:
prhs <- 1-sum(prob)
This probability can be easily added to the last value of our group in order to transform to account for values greater or equal to it (For example, instead of only having the probability that y equals to 20 we can have the probability that y is greater or equal to 20):
#Add probability
prob[length(prob)]<-prob[length(prob)]+prhs
With this we can conduct a goodness of fit test using chisq.test() function in R. It requires the observed values O and the probabilities prob that we have computed. Just a reminder that this test uses to set wrong degrees of freedom, so we can correct it by the formulation of the test that uses k-q-1 degrees. Where k is the number of groups and q is the number of parameters computed (we have computed one parameter with MLE). Next the test:
chisq.test(O,p=prob)
The output:
Chi-squared test for given probabilities
data: O
X-squared = 7.6692, df = 17, p-value = 0.9731
The key value from the test is the X-squared value which is the test statistic. We can reuse the value to obtain the real p-value (In our example, we have k=18 and minus 2, the degrees of freedom are 16).
The p.value can be obtained with next code:
p.value <- 1-pchisq(7.6692, 16)
The output:
[1] 0.9581098
As this value is not greater that known significance levels we do not reject the null hypothesis and we can affirm that y comes from a Poisson distribution.

How to do Kolmogorov-Smirnov statistic for GEV distribution in R?

I am now using the extremes package to fit a generalized extreme value (GEV) distribution, and I want to use the Kolmogorov-Smirnov test to estimate the goodness of fit, but get the following error:
library(extRemes)
library(eva)
data("PORTw", package = "extRemes")
fit1 <- fevd(TMX1, PORTw, units = "deg C")
ks.test(PORTw$TMX1,"pgev",fit1$results$par[[1]],fit1$results$par[[2]],shape=fit1$results$par[[3]])
`Warning message:
In ks.test(PORTw$TMX1, "pgev", fit1$results$par[[1]], fit1$results$par[[2]], :
ties should not be present for the Kolmogorov-Smirnov test`
So, my question is, how to perform a Kolmogorov-Smirnov test for the GEV fit with ties? Or, is there any other goodness of fit test for fitting a distribution available in R? Thanks a lot.
I recommend the "EnvStats" package.
You will have more versatility for goodness of fit test:
library(EnvStats)
# For a data set called X
X <- rgevd(500)
# Generalized Extreme Value (EnvStats)
egevd(X, method = "mle")# Maximum likelihood
# Goodness of fit test
gofTest(X, distribution = "gev",test = "ks")#Kolmogorov-Smirnov
gofTest(X, distribution = "gev",test = "chisq")#Chi-Squared

point biserial and p-value

I am trying to get a point biserial correlation between a continuous vocabulary score and syntactic productivity (dichotomous: productive vs not_productive).
I tried both the ltm packages
> biserial.cor (lol$voc1_tvl, lol$synt, use = c("complete.obs"))
and the polycor package
> polyserial( lol$voc1_tvl, lol$synt, ML = FALSE, control = list(), std.err = FALSE, maxcor=.9999, bins=4)
The problem is that neither test gives me a p-value
How could I run a point biserial correlation test and get the associated p-value or alternatively calculate the p-value myself?
Since the point biserial correlation is just a particular case of the popular Peason's product-moment coefficient, you can use cor.test to approximate (more on that later) the correlation between a continuous X and a dichotomous Y. For example, given the following data:
set.seed(23049)
x <- rnorm(1e3)
y <- sample(0:1, 1e3, replace = TRUE)
Running cor.test(x, y) will give you the information you want.
Pearson's product-moment correlation
data: x and y
t = -1.1971, df = 998, p-value = 0.2316
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.09962497 0.02418410
sample estimates:
cor
-0.03786575
As an indication of the similarity between the coefficients, notice how the calculated correlation of -0.03786575 is similar to what ltm::biserial.cor gives you:
> library(ltm)
> biserial.cor(x, y, level = 2)
[1] -0.03784681
The diference lies on the fact that biserial.cor is calculated on the population, with standard deviations being divided by n, where cor and cor.test calculate standard deviations for a sample, dividing by n - 1.
As cgage noted, you can also use the polyserial() function, which in my example would yield
> polyserial(x, y, std.err = TRUE)
Polyserial Correlation, 2-step est. = -0.04748 (0.03956)
Test of bivariate normality: Chisquare = 1.891, df = 5, p = 0.864
Here, I believe the difference in the calculated correlation (-0.04748) is due to polyserial using an optimization algorithm to approximate the calculation (which is unnecessary unless Y has more than two levels).
Using the ggplot2 dataset mpg as a reproducible example:
library(ggplot2)
# Use class as dichotomous variable (must subset)
newData = subset(mpg, class == 'midsize' | class == 'compact')
# Now getting p-value
library(ltm)
polyserial(newData$cty,newData$class, std.err = T)
You will see all the output you desire using std.err=T in polyserial

P-value for polyserial correlation

I have some basic questions concerning the polyserial() {polycor} function.
Does a p-value exist for rho, or can it be calculated?
For the assumption of a bivariate
normal, is the tested null hypothesis "Yes, bivariate normal"? That is, do I want a high or low p-value.
Thanks.
If you form the returned object with:
polS <- polyserial(x, y, ML=TRUE, std.err=TRUE) # ML estimate
... You should have no difficulty forming a p-value for the hypothesis: rho == 0 using a z-statistic formed by the ratio of a parameter divided by its standard error. But that is not the same as testing the assumption of bivariate normality. For that you need to examine "chisq" component of polS. The print method for objects of class 'polycor' hands that to you in a nice little sentence. You interpret that result in the usual manner: Low p-values are stronger evidence against the null hypothesis (in this case H0: bivariate normality). As a scientist, you do not "want" either result. You want to understand what the data is telling you.
I e-mailed the package author -because I had the same questions) and based on his clarifications, I offer my answers:
First, the easy question: higher p-values (traditionally > 0.05) give you more confidence that the distribution is bivariate normal. Lower p-values indicate a non-normal distribution, BUT, if the sample size is sufficiently large, the maximum likelihood estimate (option ML=TRUE), non-normality doesn't matter; the correlation is still reliable anyway.
Now, for the harder question: to calculate the p-value, you need to:
Execute polyserial with the std.err=TRUE option to have access to more details.
From the resulting polyserial object, access the var[1, 1] element. var is the covariance matrix of the parameter estimates, and sqrt(var[1, 1]) is the standard error (which displays in parentheses in the output after the rho result).
From the standard error, you can calculate the p-value based on the R code below.
Here's some code to illustrate this with copiable R-code, based on the example code in the polyserial documentation:
library(mvtnorm)
library(polycor)
set.seed(12345)
data <- rmvnorm(1000, c(0, 0), matrix(c(1, .5, .5, 1), 2, 2))
x <- data[,1]
y <- data[,2]
y <- cut(y, c(-Inf, -1, .5, 1.5, Inf))
# 2-step estimate
poly_2step <- polyserial(x, y, std.err=TRUE)
poly_2step
##
## Polyserial Correlation, 2-step est. = 0.5085 (0.02413)
## Test of bivariate normality: Chisquare = 8.604, df = 11, p = 0.6584
std.err_2step <- sqrt(poly_2step$var[1, 1])
std.err_2step
## [1] 0.02413489
p_value_2step <- 2 * pnorm(-abs(poly_2step$rho / std.err_2step))
p_value_2step
## [1] 1.529176e-98
# ML estimate
poly_ML <- polyserial(x, y, ML=TRUE, std.err=TRUE)
poly_ML
##
## Polyserial Correlation, ML est. = 0.5083 (0.02466)
## Test of bivariate normality: Chisquare = 8.548, df = 11, p = 0.6635
##
## 1 2 3
## Threshold -0.98560 0.4812 1.50700
## Std.Err. 0.04408 0.0379 0.05847
std.err_ML <- sqrt(poly_ML$var[1, 1])
std.err_ML
## [1] 0.02465517
p_value_ML <- 2 * pnorm(-abs(poly_ML$rho / std.err_ML))
p_value_ML
##
## 1.927146e-94
And to answer an important question that you didn't ask: you would want to always use the maximum likelihood version (ML=TRUE) because it is more accurate, except if you have a really slow computer, in which case the default 2-step approach is acceptable.

Chi-squared goodness of fit test in R

I have a vector of observed values and also a vector of values calculated with model:
actual <- c(1411,439,214,100,62,38,29,64)
expected <- c(1425.3,399.5,201.6,116.9,72.2,46.3,30.4,64.8)
Now I'm using the Chi-squared goodness of fit test to see how well my model performs.
I wrote the following:
chisq.test(expected,actual)
but it doesn't work. Can you help me with this?
X^2 = 10.2 at 7 degrees of freedom will give you a p ~ 0.18 .
> 1-pchisq(10.2, df = 7)
[1] 0.1775201
You should pass on the expected values under argument p. Make sure you scale your values to sum to 1.
> chisq.test(actual, p = expected/sum(expected))
Chi-squared test for given probabilities
data: actual
X-squared = 10.2581, df = 7, p-value = 0.1744
This about what X^2 test is doing. You give the function a model (expected) and ask - how likely it is that my observed data came from a population that "generated" expected?

Resources