P-value for polyserial correlation - r

I have some basic questions concerning the polyserial() {polycor} function.
Does a p-value exist for rho, or can it be calculated?
For the assumption of a bivariate
normal, is the tested null hypothesis "Yes, bivariate normal"? That is, do I want a high or low p-value.
Thanks.

If you form the returned object with:
polS <- polyserial(x, y, ML=TRUE, std.err=TRUE) # ML estimate
... You should have no difficulty forming a p-value for the hypothesis: rho == 0 using a z-statistic formed by the ratio of a parameter divided by its standard error. But that is not the same as testing the assumption of bivariate normality. For that you need to examine "chisq" component of polS. The print method for objects of class 'polycor' hands that to you in a nice little sentence. You interpret that result in the usual manner: Low p-values are stronger evidence against the null hypothesis (in this case H0: bivariate normality). As a scientist, you do not "want" either result. You want to understand what the data is telling you.

I e-mailed the package author -because I had the same questions) and based on his clarifications, I offer my answers:
First, the easy question: higher p-values (traditionally > 0.05) give you more confidence that the distribution is bivariate normal. Lower p-values indicate a non-normal distribution, BUT, if the sample size is sufficiently large, the maximum likelihood estimate (option ML=TRUE), non-normality doesn't matter; the correlation is still reliable anyway.
Now, for the harder question: to calculate the p-value, you need to:
Execute polyserial with the std.err=TRUE option to have access to more details.
From the resulting polyserial object, access the var[1, 1] element. var is the covariance matrix of the parameter estimates, and sqrt(var[1, 1]) is the standard error (which displays in parentheses in the output after the rho result).
From the standard error, you can calculate the p-value based on the R code below.
Here's some code to illustrate this with copiable R-code, based on the example code in the polyserial documentation:
library(mvtnorm)
library(polycor)
set.seed(12345)
data <- rmvnorm(1000, c(0, 0), matrix(c(1, .5, .5, 1), 2, 2))
x <- data[,1]
y <- data[,2]
y <- cut(y, c(-Inf, -1, .5, 1.5, Inf))
# 2-step estimate
poly_2step <- polyserial(x, y, std.err=TRUE)
poly_2step
##
## Polyserial Correlation, 2-step est. = 0.5085 (0.02413)
## Test of bivariate normality: Chisquare = 8.604, df = 11, p = 0.6584
std.err_2step <- sqrt(poly_2step$var[1, 1])
std.err_2step
## [1] 0.02413489
p_value_2step <- 2 * pnorm(-abs(poly_2step$rho / std.err_2step))
p_value_2step
## [1] 1.529176e-98
# ML estimate
poly_ML <- polyserial(x, y, ML=TRUE, std.err=TRUE)
poly_ML
##
## Polyserial Correlation, ML est. = 0.5083 (0.02466)
## Test of bivariate normality: Chisquare = 8.548, df = 11, p = 0.6635
##
## 1 2 3
## Threshold -0.98560 0.4812 1.50700
## Std.Err. 0.04408 0.0379 0.05847
std.err_ML <- sqrt(poly_ML$var[1, 1])
std.err_ML
## [1] 0.02465517
p_value_ML <- 2 * pnorm(-abs(poly_ML$rho / std.err_ML))
p_value_ML
##
## 1.927146e-94
And to answer an important question that you didn't ask: you would want to always use the maximum likelihood version (ML=TRUE) because it is more accurate, except if you have a really slow computer, in which case the default 2-step approach is acceptable.

Related

mgcv: obtain predictive distribution of response given new data (negative binomial example)

In GAM (and GLM, for that matter), we're fitting a conditional likelihood model. So after fitting the model, for a new input x and response y, I should be able to compute the predictive probability or density of a specific value of y given x. I might want to do this to compare the fit of various models on validation data, for example. Is there a convenient way to do this with a fitted GAM in mgcv? Otherwise, how do I figure out the exact form of the density that is used so I can plug in the parameters appropriately?
As a specific example, consider a negative binomial GAM :
## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)
## negative binomial data...
dat$y <- rnbinom(g,size=3,mu=g)
## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat)
And now I want to compute the predictive probability of, say, y=7, given x=(.1,.2,.3,.4).
Yes. mgcv is doing (empirical) Bayesian estimation, so you can obtain predictive distribution. For your example, here is how.
# prediction on the link (with standard error)
l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)
# Under central limit theory in GLM theory, link value is normally distributed
# for negative binomial with `log` link, the response is log-normal
p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])
# joint density of `y` and `mu`
p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)
# marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
# I have carefully written this function so it can take vector input
p.y <- function (y) {
scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
sapply(y, scalar.p.y)
}
Now since you want probability of y = 7, conditional on specified new data, use
p.y(7)
# 0.07810065
In general, this approach by numerical integration is not easy. For example, if other link functions like sqrt() is used for negative binomial, the distribution of response is not that straightforward (though also not difficult to derive).
Now I offer a sampling based approach, or Monte Carlo approach. This is most similar to Bayesian procedure.
N <- 1000 # samples size
set.seed(0)
## draw N samples from posterior of `mu`
sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))
## draw N samples from likelihood `Pr(y|mu)`
sample.y <- rnbinom(1000, size = 3, mu = sample.mu)
## Monte Carlo estimation for `Pr(y = 7)`
mean(sample.y == 7)
# 0.076
Remark 1
Note that as empirical Bayes, all above methods are conditional on estimated smoothing parameters. If you want something like a "full Bayes", set unconditional = TRUE in predict().
Remark 2
Perhaps some people are assuming the solution as simple as this:
mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
dnbinom(7, size = 3, mu = mu)
Such result is conditional on regression coefficients (assumed fixed without uncertainty), thus mu becomes fixed and not random. This is not predictive distribution. Predictive distribution would integrate out uncertainty of model estimation.

point biserial and p-value

I am trying to get a point biserial correlation between a continuous vocabulary score and syntactic productivity (dichotomous: productive vs not_productive).
I tried both the ltm packages
> biserial.cor (lol$voc1_tvl, lol$synt, use = c("complete.obs"))
and the polycor package
> polyserial( lol$voc1_tvl, lol$synt, ML = FALSE, control = list(), std.err = FALSE, maxcor=.9999, bins=4)
The problem is that neither test gives me a p-value
How could I run a point biserial correlation test and get the associated p-value or alternatively calculate the p-value myself?
Since the point biserial correlation is just a particular case of the popular Peason's product-moment coefficient, you can use cor.test to approximate (more on that later) the correlation between a continuous X and a dichotomous Y. For example, given the following data:
set.seed(23049)
x <- rnorm(1e3)
y <- sample(0:1, 1e3, replace = TRUE)
Running cor.test(x, y) will give you the information you want.
Pearson's product-moment correlation
data: x and y
t = -1.1971, df = 998, p-value = 0.2316
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.09962497 0.02418410
sample estimates:
cor
-0.03786575
As an indication of the similarity between the coefficients, notice how the calculated correlation of -0.03786575 is similar to what ltm::biserial.cor gives you:
> library(ltm)
> biserial.cor(x, y, level = 2)
[1] -0.03784681
The diference lies on the fact that biserial.cor is calculated on the population, with standard deviations being divided by n, where cor and cor.test calculate standard deviations for a sample, dividing by n - 1.
As cgage noted, you can also use the polyserial() function, which in my example would yield
> polyserial(x, y, std.err = TRUE)
Polyserial Correlation, 2-step est. = -0.04748 (0.03956)
Test of bivariate normality: Chisquare = 1.891, df = 5, p = 0.864
Here, I believe the difference in the calculated correlation (-0.04748) is due to polyserial using an optimization algorithm to approximate the calculation (which is unnecessary unless Y has more than two levels).
Using the ggplot2 dataset mpg as a reproducible example:
library(ggplot2)
# Use class as dichotomous variable (must subset)
newData = subset(mpg, class == 'midsize' | class == 'compact')
# Now getting p-value
library(ltm)
polyserial(newData$cty,newData$class, std.err = T)
You will see all the output you desire using std.err=T in polyserial

Performing Anova on Bootstrapped Estimates from Quantile Regression

So I'm using the quantreg package in R to conduct quantile regression analyses to test how the effects of my predictors vary across the distribution of my outcome.
FML <- as.formula(outcome ~ VAR + c1 + c2 + c3)
quantiles <- c(0.25, 0.5, 0.75)
q.Result <- list()
for (i in quantiles){
i.no <- which(quantiles==i)
q.Result[[i.no]] <- rq(FML, tau=i, data, method="fn", na.action=na.omit)
}
Then i call anova.rq which runs a Wald test on all the models and outputs a pvalue for each covariate telling me whether the effects of each covariate vary significantly across the distribution of my outcome.
anova.Result <- anova(q.Result[[1]], q.Result[[2]], q.Result[[3]], joint=FALSE)
Thats works just fine. However, for my particular data (and in general?), bootstrapping my estimates and their error is preferable. Which i conduct with a slight modification of the code above.
q.Result <- rqs(FML, tau=quantiles, data, method="fn", na.action=na.omit)
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb",
covariance=TRUE)
Here's where i get stuck. The quantreg currently cannot peform the anova (Wald) test on boostrapped estimates. The information files on the quantreg packages specifically states that "extensions of the methods to be used in anova.rq should be made" regarding the boostrapping method.
Looking at the details of the anova.rq method. I can see that it requires 2 components not present in the quantile model when bootstrapping.
1) Hinv (Inverse Hessian Matrix). The package information files specifically states "note that for se = "boot" there is no way to split the estimated covariance matrix into its sandwich constituent parts."
2) J which, according to the information files, is "Unscaled Outer product of gradient matrix returned if cov=TRUE and se != "iid". The Huber sandwich is cov = tau (1-tau) Hinv %*% J %*% Hinv. as for the Hinv component, there is no J component when se == "boot". (Note that to make the Huber sandwich you need to add the tau (1-tau) mayonnaise yourself.)"
Can i calculate or estimate Hinv and J from the bootstrapped estimates? If not what is the best way to proceed?
Any help on this much appreciated. This my first timing posting a question here, though I've greatly benefited from the answers to other peoples questions in the past.
For question 2: You can use R = for resampling. For example:
anova(object, ..., test = "Wald", joint = TRUE, score =
"tau", se = "nid", R = 10000, trim = NULL)
Where R is the number of resampling replications for the anowar form of the test, used to estimate the reference distribution for the test statistic.
Just a heads up, you'll probably get a better response to your questions if you only include 1 question per post.
Consulted with a colleague, and he confirmed that it was unlikely that Hinv and J could be 'reverse' computed from bootstrapped estimates. However we resolved that estimates from different taus could be compared using Wald test as follows.
From object rqs produced by
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb", covariance=TRUE)
you extract the bootstrapped Beta values for variable of interest in this case VAR, the first covariate in FML for each tau
boot.Bs <- sapply(q.Summary, function (x) x[["B"]][,2])
B0 <- coef(summary(lm(FML, data)))[2,1] # Extract liner estimate data linear estimate
Then compute wald statistic and get pvalue with number of quantiles for degrees of freedom
Wald <- sum(apply(boot.Bs, 2, function (x) ((mean(x)-B0)^2)/var(x)))
Pvalue <- pchisq(Wald, ncol(boot.Bs), lower=FALSE)
You also want to verify that bootstrapped Betas are normally distributed, and if you're running many taus it can be cumbersome to check all those QQ plots so just sum them by row
qqnorm(apply(boot.Bs, 1, sum))
qqline(apply(boot.Bs, 1, sum), col = 2)
This seems to be working, and if anyone can think of anything wrong with my solution, please share

Finding critical values for the Pearson correlation coefficient

I'd like to use R to find the critical values for the Pearson correlation coefficient.
This has proved difficult to find in search engines since the standard variable for the Pearson correlation coefficient is itself r. In turn, I'm finding a lot of r critical value tables (rather than how to find this by using the statistical package R).
I'm looking for a function that will provide output like the following:
I'm comfortable finding the correlation with:
cor(x,y)
However, I'd also like to find the critical values.
Is there a function I can use to enter n (or degrees of freedom) as well as alpha in order to find the critical value?
The significance of a correlation coefficient, r, is determined by converting r to a t-statistic and then finding the significance of that t-value at the degrees of freedom that correspond to the sample size, n. So, you can use R to find the critical t-value and then convert that value back to a correlation coefficient to find the critical correlation coefficient.
critical.r <- function( n, alpha = .05 ) {
df <- n - 2
critical.t <- qt(alpha/2, df, lower.tail = F)
critical.r <- sqrt( (critical.t^2) / ( (critical.t^2) + df ) )
return(critical.r)
}
# Example usage: Critical correlation coefficient at sample size of n = 100
critical.r( 100 )
The general structure of hypothesis testing is kind of a mish-mash of two systems: Fisherian and Neyman-Pearson. Statisticians understand the differences but rarely does this get clearly presented in undergraduate stats classes. R was designed by and intended for statisticians as a toolbox, so they constructed a function named cor.test that will deliver a p-value (part of the Fisherian tradition) as well as a confidence interval for "r" (derived on the basis of the Neyman-Pearson formalism.) Fisher and Neyman had bitter disputes in their lifetime. The "critical value" terminology is part of the N-P testing strategy. It is equivalent to building a confidence interval and finding the particular statistic that reaches exactly a threshold value of 0.05 significance.
The code for constructing the inferential statistics in cor.test is available with:
methods(cor.test)
getAnywhere(cor.test.default)
# scroll down
method <- "Pearson's product-moment correlation"
#-----partial code----
r <- cor(x, y)
df <- n - 2L
ESTIMATE <- c(cor = r)
PARAMETER <- c(df = df)
STATISTIC <- c(t = sqrt(df) * r/sqrt(1 - r^2))
p <- pt(STATISTIC, df)
# ---- omitted some set up and error checking ----
# this is the confidence interval section------
z <- atanh(r)
sigma <- 1/sqrt(n - 3)
cint <- switch(alternative, less = c(-Inf, z + sigma *
qnorm(conf.level)), greater = c(z - sigma * qnorm(conf.level),
Inf), two.sided = z + c(-1, 1) * sigma * qnorm((1 +
conf.level)/2))
cint <- tanh(cint)
So now you know how R does it. Notice that there is no "critical value" mentioned. I suspect that your hope was to find some table where a tabulation of "r" and "df" was laid out displaying the minimum "r" that would reach a significance of 0.05 for a given 'df'. Such a table could be built but that's not how this particular toolbox is constructed. You should now have the tools to build it yourself.
I would do the same. But if you are using a Spearman correlation you need to convert t into r using a different formula.
just change the last line before the return in the function with this one:
critical.r <- sqrt(((critical.t^2) / (df)) + 1)

Chi square goodness of fit for random numbers generated

I have used Inverse CDF method to generate 1000 samples from an exponential and a Cauchy random variable.
Now to verify whether these belong to their relevant distributions, I have to perform Chi-Squared Test for Goodness of fit.
I have tried two approaches (as below) -
Chisq.test(y) #which has 1000 samples from supposed exponential distribution
chisq.test(z) #cauchy
I am getting the following error:
data: y
X-squared = 234.0518, df = 999, p-value = 1
Warning message:
In chisq.test(y) : Chi-squared approximation may be incorrect
chisq.test(z)
Error in chisq.test(z) :
all entries of 'x' must be nonnegative and finite
I downloaded the vcd library to use goodfit()
and typed:
t1 <- goodfit(y,type= "exponential",method= "MinChiSq")
summary(t1)
In this case, the error message:
Error: could not find function "goodfit"
can somebody please guide on how to implement the Chi-Squared GOF test properly?
Note: The samples are not from normal distribution (exponential and cauchy respectively)
I am trying to understand if it is possible to get the observed and expected data instead with no luck so far.
edit - I did type in library(vcd) before writing the rest of the code. Apologies to have assumed it was obvious.
The chisq.test(...) function is designed primarily for use with counts, so it expects its arguments to be either countable (using table(...) for example), or to be counts already. It basically creates a contingency table for x and y (the first two arguments) and then uses the chisq test to determine if they are from the same distribution.
You are probably better off using the Kolmogorov–Smirnov test, which is designed for problems like yours. The K-S test compares the ecdf of the sample to the cdf of the test distribution and tests the null hypothesis that they are the same.
set.seed(1)
df <- data.frame(y = rexp(1000),
z = rcauchy(1000, 100, 100))
ks.test(df$y,"pexp")
# One-sample Kolmogorov-Smirnov test
#
# data: df$y
# D = 0.0387, p-value = 0.1001
# alternative hypothesis: two-sided
ks.test(df$z,"pcauchy",100,100)
# One-sample Kolmogorov-Smirnov test
#
# data: df$z
# D = 0.0296, p-value = 0.3455
# alternative hypothesis: two-sided
Note that in this case, the K-S test predicts a 90% chance that your sample df$y did not come from an exponential distribution, even though it clearly did.
You can use chisq.test(...) by artificially binning your data and then comparing the counts in each bin to what would be expected from your test distribution (using p=...), but this is convoluted and the answer you get depends on the number of bins.
breaks <- c(seq(0,10,by=1))
O <- table(cut(df$y,breaks=breaks))
p <- diff(pexp(breaks))
chisq.test(O,p=p, rescale.p=T)
# Chi-squared test for given probabilities
#
# data: O
# X-squared = 7.9911, df = 9, p-value = 0.535
In this case the chisq test predicts a 47% chance that your sample did not come from an exponential distribution.
Finally, even though they are qualitative, I find Q-Q plots to be very useful. These plot quantiles of your sample against quantiles of the test distribution. If the sample is drawn from the test distribution, the Q-Q plot should fall close to the line y=x.
par(mfrow=c(1,2))
plot(qexp(seq(0,1,0.01)),quantile(df$y,seq(0,1,0.01)),
main="Q-Q Plot",ylab="df$Y", xlab="Exponential",
xlim=c(0,5),ylim=c(0,5))
plot(qcauchy(seq(0,.99,0.01),100,100),quantile(df$z,seq(0,.99,0.01)),
main="Q-Q Plot",ylab="df$Z",xlab="Cauchy",
xlim=c(-1000,1000),ylim=c(-1000,1000))
Looking at the Q-Q plots gives me much more confidence in asserting that df$y and df$z are drawn, respectively, from the Exponential and Cauchy distributions than either the K-S or ChiSq tests, even though I can't put a number on it.
# Simulation
set.seed(123)
df <- data.frame(y = rexp(1000),
z = rcauchy(1000, 100, 100)
)
#This seems to be different, probably because of how you are simulating the data
chisq.test(df$y)
# Chi-squared test for given probabilities
#
# data: df$y
# X-squared = 978.485, df = 999, p-value = 0.6726
#
# Warning message:
# In chisq.test(df$y) : Chi-squared approximation may be incorrect
3 details:
1) you need to load the package. library(vcd)
2) There is no "exponential" type of distribution in the goodfit function
3) the method is MinChisq, Not MinChiSq
.
library(vcd)
t1 <- goodfit(df$y, type= "binomial", method= "MinChisq")
summary(t1)
# Goodness-of-fit test for binomial distribution
#
# X^2 df P(> X^2)
# Pearson 31.00952 6 2.524337e-05
# Warning message:
# In summary.goodfit(t1) : Chi-squared approximation may be incorrect

Resources