How does wilcox.test() handle ties? - r

I run the pairwise.wilcox.test() on a data with many ties, I get the following warning:
Warning in wilcox.test.default(xi, xj, paired = paired, ...) :
cannot compute exact p-value with ties
I would like to know how does wilcox.test() handle the ties?
What method is used (by default) to rank the observations?
What does "P value adjustment method: holm" mean?

When there are ties, wilcox.test uses a Normal approximation. You can see the code here: here is a slightly simplified version.
## example values
x <- 1:5
y <- 2:6
## assumes mu=0
r <- c(x, y)
## slightly simplified (assumes `digits.rank` is equal to its default `Inf` value)
r <- rank(r)
NTIES <- table(r)
n.x <- length(x)
n.y <- length(y)
STATISTIC <- c("W" = sum(r[seq_along(x)]) - n.x * (n.x + 1) / 2)
z <- STATISTIC - n.x * n.y / 2
SIGMA <- sqrt((n.x * n.y / 12) *
((n.x + n.y + 1)
- sum(NTIES^3 - NTIES) ## this will be zero in the absence of ties
/ ((n.x + n.y) * (n.x + n.y - 1))))
## stuff about continuity correction omitted here
z <- z/SIGMA ## z-score, used to compute p-value
2*pnorm(z) ## 2-tailed p-value (skipped testing whether in lower or upper tail)
This gives the same p-value as wilcox.test(x, y, correct = FALSE).
As for p-value adjustment ("holm"), this points you to the help page for ?p.adjust, which says that it is using the method from Holm (1979). You can find out more about the method here (for example).
Holm, S. (1979). A simple sequentially rejective multiple test
procedure. Scandinavian Journal of Statistics, 6, 65-70.
https://www.jstor.org/stable/4615733.

Related

Better optimizer for constrained multinomial likelihood

Using R, I wish to estimate a vector of parameters a_i (of arbitrary length, i.e. i = 1,...,s) with a multinomial likelihood using a corresponding vector of observations n_i totaling a sample size of N=sum_i (n_i). The probabilities p_i of the multinomial are determined by said a parameters and measurements of variable x such that p_i = (a_i * x_i)/sum_i (a_i * x_i). I wish further to impose the constraint that sum_i a_i = 1.
I've managed to get optim() to do the job as follows --- implementing the two tricks I've seen of estimating the first a_1 as 1 - sum_{i=2} a_i and additionally renormalizing all estimates to 1 --- but the accuracy and dependability of achieving convergence remains rather variable (in addition to being sensitive to the vector of starting estimates I provide), even when N is very large.
I would appreciate guidance on more robust alternatives and/or improvements.
s <- 10 # vector length
N <- 1000 # total sample size
# variable
x_i <- round(rlnorm(n_p, 2.5, 1.5))
# true parameter values
a_i <- rbeta(s, 2, 2)
a_i <- a_i / sum(a_i)
# generate observations
n_i <- rmultinom(1, N, (a_i * x_i) / sum(a_i * x_i))
# negative log-likelihood for parameters `par'
nll = function(par) {
if (any(0 > par | par > 1)) {
return(NA)
}else{
par <- c(1 - sum(par), par) # estimate first as remainder
par <- par / sum(par) # normalize
p_i <- (par * x_i) / sum(par * x_i) # model for probabilities
- sum(dmultinom(
x = n_i,
size = N,
prob = p_i,
log = TRUE
)) }
}
# starting values (dropping first)
start = rep(1/s, s-1)
fit <- optim(par = start,
fn = nll,
control = list(maxit = 10000)
)
ests = c(1 - sum(fit$par), fit$par)
cbind(a_i, ests)
par(pty = 's')
plot(a_i, ests)
abline(0, 1)

What kind of formula is used to calculate the p-value in `t.test`?

So, just a touch of backstory. I've been learning biostatistics in the past 4-5 months in university, 6 months of biomathematics before that. I only started deep diving into programming around 5 days ago.
I've been trying to redo t.test() with my own function.
test2 = function(t,u){
T = (mean(t) - u) / ( sd(t) / sqrt(length(t)))
t1=round(T, digits=5)
df=length(t)
cat(paste('t - value =', t1,
'\n','df =', df-1,
'\n','Alternative hipotézis: a minta átlag nem egyenlő a hipotetikus átlaggal'))
}
I tried searching the formula for the p-value, I found one, but when I used it, my value was different from the one within the t.test.
The t-value and the df do match t.test().
I highly appreciate any help, thank you.
P.s: Don't worry about the last line, it's in Hungarian.
The p-value can be derived from the probability function of the t distribution pt. Using this and making the notation more common with sample x and population mean mu we can use something like:
test2 <- function(x, u){
t <- (mean(x) - u) / (sd(x) / sqrt(length(x)))
df <- length(x) - 1
cat('t-value =', t, ', df =', df, ', p =', 2 * (1 - pt(q=t, df=df)), '\n')
}
set.seed(123) # remove this for other random values
## random sample
x <- rnorm(10, mean=5.5)
## population mean
mu <- 5
## own function
test2(x, mu)
## one sample t-test from R
t.test(x, mu=mu)
We get for the own test2:
t-value = 1.905175 , df = 9, p = 0.08914715
and for R's t.test
One Sample t-test
data: x
t = 1.9052, df = 9, p-value = 0.08915
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.892330 6.256922
sample estimates:
mean of x
5.574626
The definitive source of what R is doing is the source code. If you look at the source code for stats:::t.test.default (which you can get by typing stats:::t.test.default into the console, without parentheses at the end and hitting enter), you'll see that for a single-sample test like the one you're trying to do above, you would get the following:
nx <- length(x)
mx <- mean(x)
vx <- var(x)
df <- nx - 1
stderr <- sqrt(vx/nx)
tstat <- (mx - mu)/stderr
if (alternative == "less") {
pval <- pt(tstat, df)
}
else if (alternative == "greater") {
pval <- pt(tstat, df, lower.tail = FALSE)
}
else {
pval <- 2 * pt(-abs(tstat), df)
}
These are the relevant pieces (there's a lot more code in there, too).

what is intercept in coef of smooth.basis with fourie basis?

suppose I have a data like y and I fit a smooth function to this data with Fourier basis
y<- c(1,2,5,8,9,2,5)
x <- seq_along(y)
Fo <- create.fourier.basis(c(0, 7), 4)
precfd = smooth.basis(x,y,Fo)
plotfit.fd(y, x, precfd$fd)
precfd <- smooth.basis(x, y, Fo);coef(precfd)
the out put of last line gives me this:
const 411.1060285
sin1 -30.5584033
cos1 6.5740933
sin2 26.2855849
cos2 -26.0153965
I know what is the coefficient but what in const? in original formula there is no constant part as this link say:
http://lampx.tugraz.at/~hadley/num/ch3/3.3a.php
The first basis function in create.fourier.basis is a constant function to allow for a non-zero mean (intercept) in the data. From the documentation of the create.fourier.basis function:
The first basis function is the unit function with the value one everywhere. The next two are the sine/cosine pair with period defined in the argument period. The fourth and fifth are the sin/cosine series with period one half of period. And so forth. The number of basis functions is usually odd.
You can drop the first (unit) basis function in create.fourier.basis with the argument dropind = 1. Below some example code that illustrates which basis functions are used in create.fourier.basis. Note: the scaling of the basis functions depends on the period argument in create.fourier.basis.
Example 1: non-zero mean
library(fda)
## time sequence
tt <- seq(from = 0, to = 1, length = 100)
## basis functions
phi_0 <- 1
phi_1 <- function(t) sin(2 * pi * t) / sqrt(1 / 2)
phi_2 <- function(t) cos(2 * pi * t) / sqrt(1 / 2)
## signal
f1 <- 10 * phi_0 + 5 * phi_1(tt) - 5 * phi_2(tt)
## noise
eps <- rnorm(100)
## data
X1 <- f1 + eps
## create Fourier basis with intercept
four.basis1 <- create.fourier.basis(rangeval = range(tt), nbasis = 3)
## evaluate values basis functions
## eval.basis(tt, four.basis1)
## fit Fourier basis to data
four.fit1 <- smooth.basis(tt, X1, four.basis1)
coef(four.fit1)
Example 2: zero mean
## signal
f2 <- 5 * phi_1(tt) - 5 * phi_2(tt)
## data
X2 <- f2 + eps
## create Fourier basis without intercept
four.basis2 <- create.fourier.basis(rangeval = range(tt), nbasis = 3, dropind = 1)
## evaluate values basis functions
## eval.basis(tt, four.basis2)
## fit Fourier basis to data
four.fit2 <- smooth.basis(tt, X2, four.basis2)
coef(four.fit2)

How does ar.yw estimate the variance

In R, how does the function ar.yw estimate the variance? Specifically, where does the number "var.pred" come from? It does not seem to come from the usual YW estimate of the variance, nor the sum of squared residuals divided by df (even though there is disagreement about what the df should be, none of the choices give an answer equivalent to var.pred). And yes, I know that there are better methods than YW; just trying to figure out what R is doing.
set.seed(82346)
temp <- arima.sim(n=10, list(ar = 0.5), sd=1)
fit <- ar(temp, method = "yule-walker", demean = FALSE, aic=FALSE, order.max=1)
## R's estimate of the sigma squared
fit$var.pred
## YW estimate
sum(temp^2)/10 - fit$ar*sum(temp[2:10]*temp[1:9])/10
## YW if there was a mean
sum((temp-mean(temp))^2)/10 - fit$ar*sum((temp[2:10]-mean(temp))*(temp[1:9]-mean(temp)))/10
## estimate based on residuals, different possible df.
sum(na.omit(fit$resid^2))/10
sum(na.omit(fit$resid^2))/9
sum(na.omit(fit$resid^2))/8
sum(na.omit(fit$resid^2))/7
Need to read the code if it's not documented.
?ar.yw
Which says: "In ar.yw the variance matrix of the innovations is computed from the fitted coefficients and the autocovariance of x." If that is not enough explanation, then you need to look at the code:
methods(ar.yw)
#[1] ar.yw.default* ar.yw.mts*
#see '?methods' for accessing help and source code
getAnywhere(ar.yw.default)
# there are two cases that I see
x <- as.matrix(x)
nser <- ncol(x)
if (nser > 1L) # .... not your situation
#....
else{
r <- as.double(drop(xacf))
z <- .Fortran(C_eureka, as.integer(order.max), r, r,
coefs = double(order.max^2), vars = double(order.max),
double(order.max))
coefs <- matrix(z$coefs, order.max, order.max)
partialacf <- array(diag(coefs), dim = c(order.max, 1L,
1L))
var.pred <- c(r[1L], z$vars)
#.......
order <- if (aic)
(0L:order.max)[xaic == 0L]
else order.max
ar <- if (order)
coefs[order, seq_len(order)]
else numeric()
var.pred <- var.pred[order + 1L]
var.pred <- var.pred * n.used/(n.used - (order + 1L))
So you now need to find the Fortran code for C_eureka. I think I'm finding it here: https://svn.r-project.org/R/trunk/src/library/stats/src/eureka.f This is the code that aI think is returning the var.pred estimate. I'm not a time series guy and It's your responsibility to review this process for applicability to your problem.
subroutine eureka (lr,r,g,f,var,a)
c
c solves Toeplitz matrix equation toep(r)f=g(1+.)
c by Levinson's algorithm
c a is a workspace of size lr, the number
c of equations
c
snipped
c estimate the innovations variance
var(l) = var(l-1) * (1 - f(l,l)*f(l,l))
if (l .eq. lr) return
d = 0.0d0
q = 0.0d0
do 50 i = 1, l
k = l-i+2
d = d + a(i)*r(k)
q = q + f(l,i)*r(k)
50 continue

How `poly()` generates orthogonal polynomials? How to understand the "coefs" returned?

My understanding of orthogonal polynomials is that they take the form
y(x) = a1 + a2(x - c1) + a3(x - c2)(x - c3) + a4(x - c4)(x - c5)(x - c6)... up to the number of terms desired
where a1, a2 etc are coefficients to each orthogonal term (vary between fits), and c1, c2 etc are coefficients within the orthogonal terms, determined such that the terms maintain orthogonality (consistent between fits using the same x values)
I understand poly() is used to fit orthogonal polynomials. An example
x = c(1.160, 1.143, 1.126, 1.109, 1.079, 1.053, 1.040, 1.027, 1.015, 1.004, 0.994, 0.985, 0.977) # abscissae not equally spaced
y = c(1.217395, 1.604360, 2.834947, 4.585687, 8.770932, 9.996260, 9.264800, 9.155079, 7.949278, 7.317690, 6.377519, 6.409620, 6.643426)
# construct the orthogonal polynomial
orth_poly <- poly(x, degree = 5)
# fit y to orthogonal polynomial
model <- lm(y ~ orth_poly)
I would like to extract both the coefficients a1, a2 etc, as well as the orthogonal coefficients c1, c2 etc. I'm not sure how to do this. My guess is that
model$coefficients
returns the first set of coefficients, but I'm struggling with how to extract the others. Perhaps within
attributes(orth_poly)$coefs
?
Many thanks.
I have just realized that there was a closely related question Extracting orthogonal polynomial coefficients from R's poly() function? 2 years ago. The answer there is merely explaining what predict.poly does, but my answer gives a complete picture.
Section 1: How does poly represent orthogonal polynomials
My understanding of orthogonal polynomials is that they take the form
y(x) = a1 + a2(x - c1) + a3(x - c2)(x - c3) + a4(x - c4)(x - c5)(x - c6)... up to the number of terms desired
No no, there is no such clean form. poly() generates monic orthogonal polynomials which can be represented by the following recursion algorithm. This is how predict.poly generates linear predictor matrix. Surprisingly, poly itself does not use such recursion but use a brutal force: QR factorization of model matrix of ordinary polynomials for orthogonal span. However, this is equivalent to the recursion.
Section 2: Explanation of the output of poly()
Let's consider an example. Take the x in your post,
X <- poly(x, degree = 5)
# 1 2 3 4 5
# [1,] 0.484259711 0.48436462 0.48074040 0.351250507 0.25411350
# [2,] 0.406027697 0.20038942 -0.06236564 -0.303377083 -0.46801416
# [3,] 0.327795682 -0.02660187 -0.34049024 -0.338222850 -0.11788140
# ... ... ... ... ... ...
#[12,] -0.321069852 0.28705108 -0.15397819 -0.006975615 0.16978124
#[13,] -0.357884918 0.42236400 -0.40180712 0.398738364 -0.34115435
#attr(,"coefs")
#attr(,"coefs")$alpha
#[1] 1.054769 1.078794 1.063917 1.075700 1.063079
#
#attr(,"coefs")$norm2
#[1] 1.000000e+00 1.300000e+01 4.722031e-02 1.028848e-04 2.550358e-07
#[6] 5.567156e-10 1.156628e-12
Here is what those attributes are:
alpha[1] gives the x_bar = mean(x), i.e., the centre;
alpha - alpha[1] gives alpha0, alpha1, ..., alpha4 (alpha5 is computed but dropped before poly returns X, as it won't be used in predict.poly);
The first value of norm2 is always 1. The second to the last are l0, l1, ..., l5, giving the squared column norm of X; l0 is the column squared norm of the dropped P0(x - x_bar), which is always n (i.e., length(x)); while the first 1 is just padded in order for the recursion to proceed inside predict.poly.
beta0, beta1, beta2, ..., beta_5 are not returned, but can be computed by norm2[-1] / norm2[-length(norm2)].
Section 3: Implementing poly using both QR factorization and recursion algorithm
As mentioned earlier, poly does not use recursion, while predict.poly does. Personally I don't understand the logic / reason behind such inconsistent design. Here I would offer a function my_poly written myself that uses recursion to generate the matrix, if QR = FALSE. When QR = TRUE, it is a similar but not identical implementation poly. The code is very well commented, helpful for you to understand both methods.
## return a model matrix for data `x`
my_poly <- function (x, degree = 1, QR = TRUE) {
## check feasibility
if (length(unique(x)) < degree)
stop("insufficient unique data points for specified degree!")
## centring covariates (so that `x` is orthogonal to intercept)
centre <- mean(x)
x <- x - centre
if (QR) {
## QR factorization of design matrix of ordinary polynomial
QR <- qr(outer(x, 0:degree, "^"))
## X <- qr.Q(QR) * rep(diag(QR$qr), each = length(x))
## i.e., column rescaling of Q factor by `diag(R)`
## also drop the intercept
X <- qr.qy(QR, diag(diag(QR$qr), length(x), degree + 1))[, -1, drop = FALSE]
## now columns of `X` are orthorgonal to each other
## i.e., `crossprod(X)` is diagonal
X2 <- X * X
norm2 <- colSums(X * X) ## squared L2 norm
alpha <- drop(crossprod(X2, x)) / norm2
beta <- norm2 / (c(length(x), norm2[-degree]))
colnames(X) <- 1:degree
}
else {
beta <- alpha <- norm2 <- numeric(degree)
## repeat first polynomial `x` on all columns to initialize design matrix X
X <- matrix(x, nrow = length(x), ncol = degree, dimnames = list(NULL, 1:degree))
## compute alpha[1] and beta[1]
norm2[1] <- new_norm <- drop(crossprod(x))
alpha[1] <- sum(x ^ 3) / new_norm
beta[1] <- new_norm / length(x)
if (degree > 1L) {
old_norm <- new_norm
## second polynomial
X[, 2] <- Xi <- (x - alpha[1]) * X[, 1] - beta[1]
norm2[2] <- new_norm <- drop(crossprod(Xi))
alpha[2] <- drop(crossprod(Xi * Xi, x)) / new_norm
beta[2] <- new_norm / old_norm
old_norm <- new_norm
## further polynomials obtained from recursion
i <- 3
while (i <= degree) {
X[, i] <- Xi <- (x - alpha[i - 1]) * X[, i - 1] - beta[i - 1] * X[, i - 2]
norm2[i] <- new_norm <- drop(crossprod(Xi))
alpha[i] <- drop(crossprod(Xi * Xi, x)) / new_norm
beta[i] <- new_norm / old_norm
old_norm <- new_norm
i <- i + 1
}
}
}
## column rescaling so that `crossprod(X)` is an identity matrix
scale <- sqrt(norm2)
X <- X * rep(1 / scale, each = length(x))
## add attributes and return
attr(X, "coefs") <- list(centre = centre, scale = scale, alpha = alpha[-degree], beta = beta[-degree])
X
}
Section 4: Explanation of the output of my_poly
X <- my_poly(x, 5, FALSE)
The resulting matrix is as same as what is generated by poly hence left out. The attributes are not the same.
#attr(,"coefs")
#attr(,"coefs")$centre
#[1] 1.054769
#attr(,"coefs")$scale
#[1] 2.173023e-01 1.014321e-02 5.050106e-04 2.359482e-05 1.075466e-06
#attr(,"coefs")$alpha
#[1] 0.024025005 0.009147498 0.020930616 0.008309835
#attr(,"coefs")$beta
#[1] 0.003632331 0.002178825 0.002478848 0.002182892
my_poly returns construction information more apparently:
centre gives x_bar = mean(x);
scale gives column norms (the square root of norm2 returned by poly);
alpha gives alpha1, alpha2, alpha3, alpha4;
beta gives beta1, beta2, beta3, beta4.
Section 5: Prediction routine for my_poly
Since my_poly returns different attributes, stats:::predict.poly is not compatible with my_poly. Here is the appropriate routine my_predict_poly:
## return a linear predictor matrix, given a model matrix `X` and new data `x`
my_predict_poly <- function (X, x) {
## extract construction info
coefs <- attr(X, "coefs")
centre <- coefs$centre
alpha <- coefs$alpha
beta <- coefs$beta
degree <- ncol(X)
## centring `x`
x <- x - coefs$centre
## repeat first polynomial `x` on all columns to initialize design matrix X
X <- matrix(x, length(x), degree, dimnames = list(NULL, 1:degree))
if (degree > 1L) {
## second polynomial
X[, 2] <- (x - alpha[1]) * X[, 1] - beta[1]
## further polynomials obtained from recursion
i <- 3
while (i <= degree) {
X[, i] <- (x - alpha[i - 1]) * X[, i - 1] - beta[i - 1] * X[, i - 2]
i <- i + 1
}
}
## column rescaling so that `crossprod(X)` is an identity matrix
X * rep(1 / coefs$scale, each = length(x))
}
Consider an example:
set.seed(0); x1 <- runif(5, min(x), max(x))
and
stats:::predict.poly(poly(x, 5), x1)
my_predict_poly(my_poly(x, 5, FALSE), x1)
give exactly the same result predictor matrix:
# 1 2 3 4 5
#[1,] 0.39726381 0.1721267 -0.10562568 -0.3312680 -0.4587345
#[2,] -0.13428822 -0.2050351 0.28374304 -0.0858400 -0.2202396
#[3,] -0.04450277 -0.3259792 0.16493099 0.2393501 -0.2634766
#[4,] 0.12454047 -0.3499992 -0.24270235 0.3411163 0.3891214
#[5,] 0.40695739 0.2034296 -0.05758283 -0.2999763 -0.4682834
Be aware that prediction routine simply takes the existing construction information rather than reconstructing polynomials.
Section 6: Just treat poly and predict.poly as a black box
There is rarely the need to understand everything inside. For statistical modelling it is sufficient to know that poly constructs polynomial basis for model fitting, whose coefficients can be found in lmObject$coefficients. When making prediction, predict.poly never needs be called by user since predict.lm will do it for you. In this way, it is absolutely OK to just treat poly and predict.poly as a black box.

Resources