Chi-squared goodness of fit test in R - r

I have a vector of observed values and also a vector of values calculated with model:
actual <- c(1411,439,214,100,62,38,29,64)
expected <- c(1425.3,399.5,201.6,116.9,72.2,46.3,30.4,64.8)
Now I'm using the Chi-squared goodness of fit test to see how well my model performs.
I wrote the following:
chisq.test(expected,actual)
but it doesn't work. Can you help me with this?

X^2 = 10.2 at 7 degrees of freedom will give you a p ~ 0.18 .
> 1-pchisq(10.2, df = 7)
[1] 0.1775201
You should pass on the expected values under argument p. Make sure you scale your values to sum to 1.
> chisq.test(actual, p = expected/sum(expected))
Chi-squared test for given probabilities
data: actual
X-squared = 10.2581, df = 7, p-value = 0.1744
This about what X^2 test is doing. You give the function a model (expected) and ask - how likely it is that my observed data came from a population that "generated" expected?

Related

Test for Poisson residuals in the analysis of variance model

I try to find any way for test Poisson residuals like normals in aov(). In my hypothetical example:
# For normal distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y1 <- rnorm(length(x), mean=10, sd=1.5)
#Normality test in aov residuals
y1.av<-aov(y1 ~ x)
shapiro.test(y1.av$res)
# Shapiro-Wilk normality test
#
#data: y1.av$res
#W = 0.99782, p-value = 0.7885
Sounds silly, OK!!
Now, I'll like to make a same approche but for Poisson distribution:
# For Poisson distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y2 <- rpois(x, lambda=10)
#Normality test in aov residuals
y2.av<-aov(y2 ~ x)
poisson.test(y2.av$res)
Error in poisson.test(y2.av$res) :
'x' must be finite, nonnegative, and integer
There is any stat approach for make this?
Thanks!
You could analyse your data below a counting context. Discrete data, such as variables of Poisson nature, can be analysed based on observed frequencies. You can formulate hypothesis testing for this task. Being your data y you can contrast the null hypothesis that y follows a Poisson distribution with some parameter lambda against the alternative hypothesis that y does not come from the Poisson distribution. Let's sketch the test with you data:
#Data
set.seed(123)
# For Poisson distribution
x <- rep(seq(from=10, to=50, by=0.5),6)
y2 <- rpois(x, lambda=10)
Now we obtain the counts, which are elemental for the test:
#Values
df <- as.data.frame(table(y2),stringsAsFactors = F)
df$y2 <- as.integer(df$y2)
After that we must separate the observed values O and its groups or categories classes. Both elements constitute the y variable:
#Observed values
O <- df$Freq
#Groups
classes <- df$y2
As we are testing a Poisson distribution, we must compute the lambda parameter. This can be obtained with Maximum Likelihood Estimation (MLE). The MLE for Poisson is the mean (considering we have counts and groups in order to determine this value), so we compute it with next code:
#MLE
meanval <- sum(O*classes)/sum(O)
Now, we have to get the probabilities of each class:
#Probs
prob <- dpois(classes,meanval)
Poisson distribution can go to infinite values, so we must compute the probability for the values that can be greater than our last group in order to have probabilities that sum to one:
prhs <- 1-sum(prob)
This probability can be easily added to the last value of our group in order to transform to account for values greater or equal to it (For example, instead of only having the probability that y equals to 20 we can have the probability that y is greater or equal to 20):
#Add probability
prob[length(prob)]<-prob[length(prob)]+prhs
With this we can conduct a goodness of fit test using chisq.test() function in R. It requires the observed values O and the probabilities prob that we have computed. Just a reminder that this test uses to set wrong degrees of freedom, so we can correct it by the formulation of the test that uses k-q-1 degrees. Where k is the number of groups and q is the number of parameters computed (we have computed one parameter with MLE). Next the test:
chisq.test(O,p=prob)
The output:
Chi-squared test for given probabilities
data: O
X-squared = 7.6692, df = 17, p-value = 0.9731
The key value from the test is the X-squared value which is the test statistic. We can reuse the value to obtain the real p-value (In our example, we have k=18 and minus 2, the degrees of freedom are 16).
The p.value can be obtained with next code:
p.value <- 1-pchisq(7.6692, 16)
The output:
[1] 0.9581098
As this value is not greater that known significance levels we do not reject the null hypothesis and we can affirm that y comes from a Poisson distribution.

point biserial and p-value

I am trying to get a point biserial correlation between a continuous vocabulary score and syntactic productivity (dichotomous: productive vs not_productive).
I tried both the ltm packages
> biserial.cor (lol$voc1_tvl, lol$synt, use = c("complete.obs"))
and the polycor package
> polyserial( lol$voc1_tvl, lol$synt, ML = FALSE, control = list(), std.err = FALSE, maxcor=.9999, bins=4)
The problem is that neither test gives me a p-value
How could I run a point biserial correlation test and get the associated p-value or alternatively calculate the p-value myself?
Since the point biserial correlation is just a particular case of the popular Peason's product-moment coefficient, you can use cor.test to approximate (more on that later) the correlation between a continuous X and a dichotomous Y. For example, given the following data:
set.seed(23049)
x <- rnorm(1e3)
y <- sample(0:1, 1e3, replace = TRUE)
Running cor.test(x, y) will give you the information you want.
Pearson's product-moment correlation
data: x and y
t = -1.1971, df = 998, p-value = 0.2316
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.09962497 0.02418410
sample estimates:
cor
-0.03786575
As an indication of the similarity between the coefficients, notice how the calculated correlation of -0.03786575 is similar to what ltm::biserial.cor gives you:
> library(ltm)
> biserial.cor(x, y, level = 2)
[1] -0.03784681
The diference lies on the fact that biserial.cor is calculated on the population, with standard deviations being divided by n, where cor and cor.test calculate standard deviations for a sample, dividing by n - 1.
As cgage noted, you can also use the polyserial() function, which in my example would yield
> polyserial(x, y, std.err = TRUE)
Polyserial Correlation, 2-step est. = -0.04748 (0.03956)
Test of bivariate normality: Chisquare = 1.891, df = 5, p = 0.864
Here, I believe the difference in the calculated correlation (-0.04748) is due to polyserial using an optimization algorithm to approximate the calculation (which is unnecessary unless Y has more than two levels).
Using the ggplot2 dataset mpg as a reproducible example:
library(ggplot2)
# Use class as dichotomous variable (must subset)
newData = subset(mpg, class == 'midsize' | class == 'compact')
# Now getting p-value
library(ltm)
polyserial(newData$cty,newData$class, std.err = T)
You will see all the output you desire using std.err=T in polyserial

Finding critical values for the Pearson correlation coefficient

I'd like to use R to find the critical values for the Pearson correlation coefficient.
This has proved difficult to find in search engines since the standard variable for the Pearson correlation coefficient is itself r. In turn, I'm finding a lot of r critical value tables (rather than how to find this by using the statistical package R).
I'm looking for a function that will provide output like the following:
I'm comfortable finding the correlation with:
cor(x,y)
However, I'd also like to find the critical values.
Is there a function I can use to enter n (or degrees of freedom) as well as alpha in order to find the critical value?
The significance of a correlation coefficient, r, is determined by converting r to a t-statistic and then finding the significance of that t-value at the degrees of freedom that correspond to the sample size, n. So, you can use R to find the critical t-value and then convert that value back to a correlation coefficient to find the critical correlation coefficient.
critical.r <- function( n, alpha = .05 ) {
df <- n - 2
critical.t <- qt(alpha/2, df, lower.tail = F)
critical.r <- sqrt( (critical.t^2) / ( (critical.t^2) + df ) )
return(critical.r)
}
# Example usage: Critical correlation coefficient at sample size of n = 100
critical.r( 100 )
The general structure of hypothesis testing is kind of a mish-mash of two systems: Fisherian and Neyman-Pearson. Statisticians understand the differences but rarely does this get clearly presented in undergraduate stats classes. R was designed by and intended for statisticians as a toolbox, so they constructed a function named cor.test that will deliver a p-value (part of the Fisherian tradition) as well as a confidence interval for "r" (derived on the basis of the Neyman-Pearson formalism.) Fisher and Neyman had bitter disputes in their lifetime. The "critical value" terminology is part of the N-P testing strategy. It is equivalent to building a confidence interval and finding the particular statistic that reaches exactly a threshold value of 0.05 significance.
The code for constructing the inferential statistics in cor.test is available with:
methods(cor.test)
getAnywhere(cor.test.default)
# scroll down
method <- "Pearson's product-moment correlation"
#-----partial code----
r <- cor(x, y)
df <- n - 2L
ESTIMATE <- c(cor = r)
PARAMETER <- c(df = df)
STATISTIC <- c(t = sqrt(df) * r/sqrt(1 - r^2))
p <- pt(STATISTIC, df)
# ---- omitted some set up and error checking ----
# this is the confidence interval section------
z <- atanh(r)
sigma <- 1/sqrt(n - 3)
cint <- switch(alternative, less = c(-Inf, z + sigma *
qnorm(conf.level)), greater = c(z - sigma * qnorm(conf.level),
Inf), two.sided = z + c(-1, 1) * sigma * qnorm((1 +
conf.level)/2))
cint <- tanh(cint)
So now you know how R does it. Notice that there is no "critical value" mentioned. I suspect that your hope was to find some table where a tabulation of "r" and "df" was laid out displaying the minimum "r" that would reach a significance of 0.05 for a given 'df'. Such a table could be built but that's not how this particular toolbox is constructed. You should now have the tools to build it yourself.
I would do the same. But if you are using a Spearman correlation you need to convert t into r using a different formula.
just change the last line before the return in the function with this one:
critical.r <- sqrt(((critical.t^2) / (df)) + 1)

Chi square goodness of fit for random numbers generated

I have used Inverse CDF method to generate 1000 samples from an exponential and a Cauchy random variable.
Now to verify whether these belong to their relevant distributions, I have to perform Chi-Squared Test for Goodness of fit.
I have tried two approaches (as below) -
Chisq.test(y) #which has 1000 samples from supposed exponential distribution
chisq.test(z) #cauchy
I am getting the following error:
data: y
X-squared = 234.0518, df = 999, p-value = 1
Warning message:
In chisq.test(y) : Chi-squared approximation may be incorrect
chisq.test(z)
Error in chisq.test(z) :
all entries of 'x' must be nonnegative and finite
I downloaded the vcd library to use goodfit()
and typed:
t1 <- goodfit(y,type= "exponential",method= "MinChiSq")
summary(t1)
In this case, the error message:
Error: could not find function "goodfit"
can somebody please guide on how to implement the Chi-Squared GOF test properly?
Note: The samples are not from normal distribution (exponential and cauchy respectively)
I am trying to understand if it is possible to get the observed and expected data instead with no luck so far.
edit - I did type in library(vcd) before writing the rest of the code. Apologies to have assumed it was obvious.
The chisq.test(...) function is designed primarily for use with counts, so it expects its arguments to be either countable (using table(...) for example), or to be counts already. It basically creates a contingency table for x and y (the first two arguments) and then uses the chisq test to determine if they are from the same distribution.
You are probably better off using the Kolmogorov–Smirnov test, which is designed for problems like yours. The K-S test compares the ecdf of the sample to the cdf of the test distribution and tests the null hypothesis that they are the same.
set.seed(1)
df <- data.frame(y = rexp(1000),
z = rcauchy(1000, 100, 100))
ks.test(df$y,"pexp")
# One-sample Kolmogorov-Smirnov test
#
# data: df$y
# D = 0.0387, p-value = 0.1001
# alternative hypothesis: two-sided
ks.test(df$z,"pcauchy",100,100)
# One-sample Kolmogorov-Smirnov test
#
# data: df$z
# D = 0.0296, p-value = 0.3455
# alternative hypothesis: two-sided
Note that in this case, the K-S test predicts a 90% chance that your sample df$y did not come from an exponential distribution, even though it clearly did.
You can use chisq.test(...) by artificially binning your data and then comparing the counts in each bin to what would be expected from your test distribution (using p=...), but this is convoluted and the answer you get depends on the number of bins.
breaks <- c(seq(0,10,by=1))
O <- table(cut(df$y,breaks=breaks))
p <- diff(pexp(breaks))
chisq.test(O,p=p, rescale.p=T)
# Chi-squared test for given probabilities
#
# data: O
# X-squared = 7.9911, df = 9, p-value = 0.535
In this case the chisq test predicts a 47% chance that your sample did not come from an exponential distribution.
Finally, even though they are qualitative, I find Q-Q plots to be very useful. These plot quantiles of your sample against quantiles of the test distribution. If the sample is drawn from the test distribution, the Q-Q plot should fall close to the line y=x.
par(mfrow=c(1,2))
plot(qexp(seq(0,1,0.01)),quantile(df$y,seq(0,1,0.01)),
main="Q-Q Plot",ylab="df$Y", xlab="Exponential",
xlim=c(0,5),ylim=c(0,5))
plot(qcauchy(seq(0,.99,0.01),100,100),quantile(df$z,seq(0,.99,0.01)),
main="Q-Q Plot",ylab="df$Z",xlab="Cauchy",
xlim=c(-1000,1000),ylim=c(-1000,1000))
Looking at the Q-Q plots gives me much more confidence in asserting that df$y and df$z are drawn, respectively, from the Exponential and Cauchy distributions than either the K-S or ChiSq tests, even though I can't put a number on it.
# Simulation
set.seed(123)
df <- data.frame(y = rexp(1000),
z = rcauchy(1000, 100, 100)
)
#This seems to be different, probably because of how you are simulating the data
chisq.test(df$y)
# Chi-squared test for given probabilities
#
# data: df$y
# X-squared = 978.485, df = 999, p-value = 0.6726
#
# Warning message:
# In chisq.test(df$y) : Chi-squared approximation may be incorrect
3 details:
1) you need to load the package. library(vcd)
2) There is no "exponential" type of distribution in the goodfit function
3) the method is MinChisq, Not MinChiSq
.
library(vcd)
t1 <- goodfit(df$y, type= "binomial", method= "MinChisq")
summary(t1)
# Goodness-of-fit test for binomial distribution
#
# X^2 df P(> X^2)
# Pearson 31.00952 6 2.524337e-05
# Warning message:
# In summary.goodfit(t1) : Chi-squared approximation may be incorrect

How to do: Correlation with "blocks" (or - "repeated measures" ?!)?

I have the following setup to analyse:
We have about 150 subjects, and for each subject we performed a pair of tests (under different conditions) 18 times.
The 18 different conditions of the test are complementary, in such a way so that if we where to average over the tests (for each subject), we would get no correlation between the tests (between subjects).
What we wish to know is the correlation (and P value) between the tests, in within subjects, but over all the subjects.
The way I did this by now was to perform the correlation for each subject, and then look at the distribution of the correlations received so to see if it's mean is different then 0.
But I suspect there might be a better way for answering the same question (someone said to me something about "geographical correlation", but a shallow search didn't help).
p.s: I understand there might be a place here to do some sort of mixed model, but I would prefer to present a "correlation", and am not sure how to extract such an output from a mixed model.
Also, here is a short dummy code to give an idea of what I am talking about:
attach(longley)
N <- length(Unemployed)
block <- c(
rep( "a", N),
rep( "b", N),
rep( "c", N)
)
Unemployed.3 <- c(Unemployed + rnorm(1),
Unemployed + rnorm(1),
Unemployed + rnorm(1))
GNP.deflator.3 <- c(GNP.deflator + rnorm(1),
GNP.deflator + rnorm(1),
GNP.deflator + rnorm(1))
cor(Unemployed, GNP.deflator)
cor(Unemployed.3, GNP.deflator.3)
cor(Unemployed.3[block == "a"], GNP.deflator.3[block == "a"])
cor(Unemployed.3[block == "b"], GNP.deflator.3[block == "b"])
cor(Unemployed.3[block == "c"], GNP.deflator.3[block == "c"])
(I would like to somehow combine the last three correlations...)
Any ideas will be welcomed.
Best,
Tal
I agree with Tristan - you are looking for ICC. The only difference from standard implementations is that the two raters (tests) evaluate each subject repeatedly. There might be an implementation that allows that. In the meanwhile here is another approach to get the correlation.
You can use "general linear models", which are generalizations of linear models that explicitly allow correlation between residuals. The code below implements this using the gls function of the nlme package. I am sure there are other ways as well. To use this function we have to first reshape the data into a "long" format. I also changed the variable names to x and y for simplicity. I also used +rnorm(N) instead of +rnorm(1) in your code, because that's what I think you meant.
library(reshape)
library(nlme)
dd <- data.frame(x=Unemployed.3, y=GNP.deflator.3, block=factor(block))
dd$occasion <- factor(rep(1:N, 3)) # variable denoting measurement occasions
dd2 <- melt(dd, id=c("block","occasion")) # reshape
# fit model with the values within a measurement occasion correlated
# and different variances allowed for the two variables
mod <- gls(value ~ variable + block, data=dd2,
cor=corSymm(form=~1|block/occasion),
weights=varIdent(form=~1|variable))
# extract correlation
mod$modelStruct$corStruct
In the modeling framework you can use a likelihood ratio test to get a p-value. nlme can also give you a confidence interval:
mod2 <- gls(value ~ variable + block, data=dd2,
weights=varIdent(form=~1|variable))
anova(mod, mod2) # likelihood-ratio test for corr=0
intervals(mod)$corStruct # confidence interval for the correlation
If I understand your question correctly, you are interested in computing the intraclass correlation between multiple tests. There is an implementation in the psy package, although I have not used it.
If you want to perform inference on the correlation estimate, you could bootstrap the subjects. Just make sure to keep the tests together for each sample.
I'm no expert, but this looks to me like what you want. It's automated, short to code, gives the same correlations as your example above, and produces p-values.
> df = data.frame(block=block, Unemployed=Unemployed.3,
+ GNP.deflator=GNP.deflator.3)
> require(plyr)
Loading required package: plyr
> ddply(df, "block", function(x){
+ as.data.frame(
+ with(x,cor.test(Unemployed, GNP.deflator))[c("p.value","estimate")]
+ )})
block p.value estimate
1 a 0.01030636 0.6206334
2 b 0.01030636 0.6206334
3 c 0.01030636 0.6206334
To see all the details, do this:
> dlply(df, "block", function(x){with(x,cor.test(Unemployed, GNP.deflator))})
$a
Pearson's product-moment correlation
data: Unemployed and GNP.deflator
t = 2.9616, df = 14, p-value = 0.01031
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1804410 0.8536976
sample estimates:
cor
0.6206334
$b
Pearson's product-moment correlation
data: Unemployed and GNP.deflator
t = 2.9616, df = 14, p-value = 0.01031
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1804410 0.8536976
sample estimates:
cor
0.6206334
$c
Pearson's product-moment correlation
data: Unemployed and GNP.deflator
t = 2.9616, df = 14, p-value = 0.01031
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1804410 0.8536976
sample estimates:
cor
0.6206334
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
block
1 a
2 b
3 c

Resources