So, just a touch of backstory. I've been learning biostatistics in the past 4-5 months in university, 6 months of biomathematics before that. I only started deep diving into programming around 5 days ago.
I've been trying to redo t.test() with my own function.
test2 = function(t,u){
T = (mean(t) - u) / ( sd(t) / sqrt(length(t)))
t1=round(T, digits=5)
df=length(t)
cat(paste('t - value =', t1,
'\n','df =', df-1,
'\n','Alternative hipotézis: a minta átlag nem egyenlő a hipotetikus átlaggal'))
}
I tried searching the formula for the p-value, I found one, but when I used it, my value was different from the one within the t.test.
The t-value and the df do match t.test().
I highly appreciate any help, thank you.
P.s: Don't worry about the last line, it's in Hungarian.
The p-value can be derived from the probability function of the t distribution pt. Using this and making the notation more common with sample x and population mean mu we can use something like:
test2 <- function(x, u){
t <- (mean(x) - u) / (sd(x) / sqrt(length(x)))
df <- length(x) - 1
cat('t-value =', t, ', df =', df, ', p =', 2 * (1 - pt(q=t, df=df)), '\n')
}
set.seed(123) # remove this for other random values
## random sample
x <- rnorm(10, mean=5.5)
## population mean
mu <- 5
## own function
test2(x, mu)
## one sample t-test from R
t.test(x, mu=mu)
We get for the own test2:
t-value = 1.905175 , df = 9, p = 0.08914715
and for R's t.test
One Sample t-test
data: x
t = 1.9052, df = 9, p-value = 0.08915
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
4.892330 6.256922
sample estimates:
mean of x
5.574626
The definitive source of what R is doing is the source code. If you look at the source code for stats:::t.test.default (which you can get by typing stats:::t.test.default into the console, without parentheses at the end and hitting enter), you'll see that for a single-sample test like the one you're trying to do above, you would get the following:
nx <- length(x)
mx <- mean(x)
vx <- var(x)
df <- nx - 1
stderr <- sqrt(vx/nx)
tstat <- (mx - mu)/stderr
if (alternative == "less") {
pval <- pt(tstat, df)
}
else if (alternative == "greater") {
pval <- pt(tstat, df, lower.tail = FALSE)
}
else {
pval <- 2 * pt(-abs(tstat), df)
}
These are the relevant pieces (there's a lot more code in there, too).
The regression model should be like the one on the left, and its value should be distributed normally like the equation on the right.
The rnorm() function support vector entry. Hence, you can generate u_i given x_i using the following line
x <- c(1:10)
u <- rnorm(rep(1,10), mean = rep(0,10), sd = sqrt(0.5*sqrt(x)))
Then
y <- 10 + x + u
To validation:
set.seed(2021)
validation.u <- NULL
for (i in 1:1000){
u_i <- rnorm(rep(1,10), mean = rep(0,10), sd = 0.5*sqrt(x))
validation.u <- rbind(validation.u, u_i)
}
colMeans(validation.u)
> [1] 0.04053143 0.02626776 -0.01915211 0.02319424 0.05187192 -0.06950472 0.01335739 0.03787621 0.01348861 0.03323902
diag(cov(validation.u))
> [1] 0.5277132 0.6585050 0.8586679 1.1081190 1.1200815 1.2266432 1.2553346 1.4172857 1.5748265 1.6788966
I notice searching through stackoverflow for similar questions that this has been asked several times hasn't really been properly answered. Perhaps with help from other users this post can be a helpful guide to programming a numerical estimate of the parameters of a multivariate normal distribution.
I know, I know! The closed form solutions are available and trivial to implement. In my case I am interested in modifying the likelihood function for a specific purpose and I don't expect an exact analytic solution so this is a test case to check the procedure.
So here is my attempt. Please comment. Especially if I am missing opportunities for optimization. Note, I'm not a statistician so I'd appreciate any pointers.
ll_multN <- function(theta,X) {
# theta = c(mu, diag(Sigma), Sigma[upper.tri(Sigma)])
# X is an nxk dataset
# MLE: L = - (nk/2)*log(2*pi) - (n/2)*log(det(Sigma)) - (1/2)*sum_i(t(X_i-mu)^2 %*% Sigma^-1 %*% (X_i-mu)^2)
# summation over i is performed using a apply call for efficiency
n <- nrow(X)
k <- ncol(X)
# def mu
mu.vec <- theta[1:k]
# def Sigma
Sigma.diag <- theta[(k+1):(2*k)]
Sigma.offd <- theta[(2*k+1):length(theta)]
Sigma <- matrix(NA, k, k)
Sigma[upper.tri(Sigma)] <- Sigma.offd
Sigma <- t(Sigma)
Sigma[upper.tri(Sigma)] <- Sigma.offd
diag(Sigma) <- Sigma.diag
# compute summation
sum_i <- sum(apply(X, 1, function(x) (matrix(x,1,k)-mu.vec)%*%solve(Sigma)%*%t(matrix(x,1,k)-mu.vec)))
# compute log likelihood
logl <- -.5*n*k*log(2*pi) - .5*n*log(det(Sigma))
logl <- logl - .5*sum_i
return(-logl)
}
Simulated dataset generated using the rmvnorm() function in the package "mvtnorm". Random positive definite covariance matrix generated using the additional function Posdef() (taken from here: https://stat.ethz.ch/pipermail/r-help/2008-February/153708)
library(mvtnorm)
Posdef <- function (n, ev = runif(n, 0, 5)) {
# generates a random positive definite covariance matrix
Z <- matrix(ncol=n, rnorm(n^2))
decomp <- qr(Z)
Q <- qr.Q(decomp)
R <- qr.R(decomp)
d <- diag(R)
ph <- d / abs(d)
O <- Q %*% diag(ph)
Z <- t(O) %*% diag(ev) %*% O
return(Z)
}
set.seed(2)
n <- 1000 # number of data points
k <- 3 # number of variables
mu.tru <- sample(0:3, k, replace=T) # random mean vector
Sigma.tru <- Posdef(k) # random covariance matrix
eigen(Sigma.tru)$val # check positive def (all lambda > 0)
# Generate simulated dataset
X <- rmvnorm(n, mean=mu.tru, sigma=Sigma.tru)
# initial parameter values
pars.init <- c(mu=rep(0,k), sig_ii=rep(1,k), sig_ij=rep(0, k*(k-1)/2))
# limits for optimization algorithm
eps <- .Machine$double.eps # get a small value for bounding the paramter space to avoid things such as log(0).
lower.bound <- c(rep(-Inf,k), # bound on mu
rep(eps,k), # bound on sigma_ii
rep(-Inf,k)) # bound on sigma_ij i=/=j
upper.bound <- c(rep(Inf,k), # bound on mu
rep(100,k), # bound on sigma_ii
rep(100,k)) # bound on sigma_ij i=/=j
system.time(
o <- optim(pars.init,
ll_multN, X=X, method="L-BFGS-B",
lower = lower.bound,
upper = upper.bound)
)
plot(x=c(mu.tru,diag(Sigma.tru),Sigma.tru[upper.tri(Sigma.tru)]),
y=o$par,
xlab="Parameter",
ylab="Estimate",
pch=20)
abline(c(0,1), col="red", lty=2)
This currently runs on my laptop in
user system elapsed
47.852 24.014 24.611
and gives this graphical output:
Estimated mean and variance
In particular any advice on limit setting or algorithm choice would be much appreciated.
Thanks
I am trying to solve for the parameters of a gamma distribution that is convolved with both normal and lognormal distributions. I can experimentally derive parameters for both the normal and lognormal components, hence, I just want to solve for the gamma params.
I have attempted 3 approaches to this problem:
1) generating convolved random datasets (i.e. rnorm()+rlnorm()+rgamma()) and using least-squares regression on the linear- or log-binned histograms of the data (not shown, but was very biased by RNG and didn't optimize well at all.)
2) "brute-force" numerical integration of the convolving functions (example code #1)
3) numerical integration approaches w/ the distr package. (example code #2)
I have had limited success with all three approaches. Importantly, these approaches seem to work well for "nominal" values for the gamma parameters, but they all begin to fail when k(shape) is low and theta(scale) is high—which is where my experimental data resides. please find the examples below.
Straight-up numerical Integration
# make the functions
f.N <- function(n) dnorm(n, N[1], N[2])
f.L <- function(l) dlnorm(l, L[1], L[2])
f.G <- function(g) dgamma(g, G[1], scale=G[2])
# make convolved functions
f.Z <- function(z) integrate(function(x,z) f.L(z-x)*f.N(x), -Inf, Inf, z)$value # L+N
f.Z <- Vectorize(f.Z)
f.Z1 <- function(z) integrate(function(x,z) f.G(z-x)*f.Z(x), -Inf, Inf, z)$value # G+(L+N)
f.Z1 <- Vectorize(f.Z1)
# params of Norm, Lnorm, and Gamma
N <- c(0,5)
L <- c(2.5,.5)
G <- c(2,7) # this distribution is the one we ultimately want to solve for.
# G <- c(.5,10) # 0<k<1
# G <- c(.25,5e4) # ballpark params of experimental data
# generate some data
set.seed(1)
rN <- rnorm(1e4, N[1], N[2])
rL <- rlnorm(1e4, L[1], L[2])
rG <- rgamma(1e4, G[1], scale=G[2])
Z <- rN + rL
Z1 <- rN + rL + rG
# check the fit
hist(Z,freq=F,breaks=100, xlim=c(-10,50), col=rgb(0,0,1,.25))
hist(Z1,freq=F,breaks=100, xlim=c(-10,50), col=rgb(1,0,0,.25), add=T)
z <- seq(-10,50,1)
lines(z,f.Z(z),lty=2,col="blue", lwd=2) # looks great... convolution performs as expected.
lines(z,f.Z1(z),lty=2,col="red", lwd=2) # this works perfectly so long as k(shape)>=1
# I'm guessing the failure to compute when shape 0 < k < 1 is due to
# numerical integration problems, but I don't know how to fix it.
integrate(dgamma, -Inf, Inf, shape=1, scale=1) # ==1
integrate(dgamma, 0, Inf, shape=1, scale=1) # ==1
integrate(dgamma, -Inf, Inf, shape=.5, scale=1) # !=1
integrate(dgamma, 0, Inf, shape=.5, scale=1) # != 1
# Let's try to estimate gamma anyway, supposing k>=1
optimFUN <- function(par, N, L) {
print(par)
-sum(log(f.Z1(Z1[1:4e2])))
}
f.G <- function(g) dgamma(g, par[1], scale=par[2])
fitresult <- optim(c(1.6,5), optimFUN, N=N, L=L)
par <- fitresult$par
lines(z,f.Z1(z),lty=2,col="green3", lwd=2) # not so great... likely better w/ more data,
# but it is SUPER slow and I observe large step sizes.
Attempting convolving via distr package
# params of Norm, Lnorm, and Gamma
N <- c(0,5)
L <- c(2.5,.5)
G <- c(2,7) # this distribution is the one we ultimately want to solve for.
# G <- c(.5,10) # 0<k<1
# G <- c(.25,5e4) # ballpark params of experimental data
# make the distributions and "convolvings'
dN <- Norm(N[1], N[2])
dL <- Lnorm(L[1], L[2])
dG <- Gammad(G[1], G[2])
d.NL <- d(convpow(dN+dL,1))
d.NLG <- d(convpow(dN+dL+dG,1)) # for large values of theta, no matter how I change
# getdistrOption("DefaultNrFFTGridPointsExponent"), grid size is always wrong.
# Generate some data
set.seed(1)
rN <- r(dN)(1e4)
rL <- r(dL)(1e4)
rG <- r(dG)(1e4)
r.NL <- rN + rL
r.NLG <- rN + rL + rG
# check the fit
hist(r.NL, freq=F, breaks=100, xlim=c(-10,50), col=rgb(0,0,1,.25))
hist(r.NLG, freq=F, breaks=100, xlim=c(-10,50), col=rgb(1,0,0,.25), add=T)
z <- seq(-10,50,1)
lines(z,d.NL(z), lty=2, col="blue", lwd=2) # looks great... convolution performs as expected.
lines(z,d.NLG(z), lty=2, col="red", lwd=2) # this appears to work perfectly
# for most values of K and low values of theta
# this is looking a lot more promising... how about estimating gamma params?
optimFUN <- function(par, dN, dL) {
tG <- Gammad(par[1],par[2])
d.NLG <- d(convpow(dN+dL+tG,1))
p <- d.NLG(r.NLG)
p[p==0] <- 1e-15 # because sometimes very low probabilities evaluate to 0...
# ...and logs don't like that.
-sum(log(p))
}
fitresult <- optim(c(1,1e4), optimFUN, dN=dN, dL=dL)
fdG <- Gammad(fitresult$par[1], fitresult$par[2])
fd.NLG <- d(convpow(dN+dL+fdG,1))
lines(z,fd.NLG(z), lty=2, col="green3", lwd=2) ## this works perfectly when ~k>1 & ~theta<100... but throws
## "Error in validityMethod(object) : shape has to be positive" when k decreases and/or theta increases
## (boundary subject to RNG).
Can i speed up the integration in example 1? can I increase the grid size in example 2 (distr package)? how can I address the k<1 problem? can I rescale the data in a way that will better facilitate evaluation at high theta values?
Is there a better way all-together?
Help!
Well, convolution of function with gaussian kernel calls for use of Gauss–Hermite quadrature. In R it is implemented in special package: https://cran.r-project.org/web/packages/gaussquad/gaussquad.pdf
UPDATE
For convolution with Gamma distribution this package might be useful as well via Gauss-Laguerre quadrature
UPDATE II
Here is quick code to convolute gaussian with lognormal,
hopefully not a lot of bugs and and prints some reasonable looking graph
library(gaussquad)
n.quad <- 170 # integration order
# get the particular weights/abscissas as data frame with 2 observables and n.quad observations
rule <- ghermite.h.quadrature.rules(n.quad, mu = 0.0)[[n.quad]]
# test function - integrate 1 over exp(-x^2) from -Inf to Inf
# should get sqrt(pi) as an answer
f <- function(x) {
1.0
}
q <- ghermite.h.quadrature(f, rule)
print(q - sqrt(pi))
# convolution of lognormal with gaussian
# because of the G-H rules, we have to make our own function
# for simplicity, sigmas are one and mus are zero
sqrt2 <- sqrt(2.0)
c.LG <- function(z) {
#print(z)
f.LG <- function(x) {
t <- (z - x*sqrt2)
q <- 0.0
if (t > 0.0) {
l <- log(t)
q <- exp( - 0.5*l*l ) / t
}
q
}
ghermite.h.quadrature(Vectorize(f.LG), rule) / (pi*sqrt2)
}
library(ggplot2)
p <- ggplot(data = data.frame(x = 0), mapping = aes(x = x))
p <- p + stat_function(fun = Vectorize(c.LG))
p <- p + xlim(-1.0, 5.0)
print(p)
I have a time-series which I need to fit onto an AR (auto-regression) model.
The AR model has the form:
x(t) = a0 + a1*x(t-1) + a2*x(t-2) + ... + aq*x(t-q) + noise.
I have two contraints:
Find the best AR fit when lag.max = 50.
Sum of all coefficients a0 + a1 + ... + aq = 1
I wrote the below code:
require(FitAR)
data(lynx) # my real data comes from the stock market.
z <- -log(lynx)
#find best model
step <- SelectModel(z, ARModel = "AR" ,lag.max = 50, Criterion = "AIC",Best=10)
summary(step) # display results
# fit the model and get coefficients
arfit <- ar(z,p=1, order.max=ceil(mean(step[,1])), aic=FALSE)
#check if sum of coefficients are 1
sum(arfit$ar)
[1] 0.5784978
My question is, how to add the constraint: sum of all coefficients = 1?
I looked at this question, but I do not realize how to use it.
**UPDATE**
I think I manage to solve my question as follow.
library(quadprog)
coeff <- arfit$ar
y <- 0
for (i in 1:length(coeff)) {
y <- y + coeff[i]*c(z[(i+1):length(z)],rep(0,i))
ifelse (i==1, X <- c(z[2:length(z)],0), X <- cbind(X,c(z[(i+1):length(z)],rep(0,i))))
}
Dmat <- t(X) %*% X
s <- solve.QP(Dmat , t(y) %*% X, matrix(1, nr=15, nc=1), 1, meq=1 )
s$solution
# The coefficients should sum up to 1
sum(s$solution)