I currently have constructed a 95% confidence interval and have then used replicate() to randomly generate 1000 confidence intervals. I want to measure how many of the intervals contain my mean. I know in theory it should be in 950 of them but how do I get a definite answer? The function I used and the mean are listed below.
z <- function(a,b,c){
error <- rnorm(a, b, c) * c / sqrt(a)
left <- b - error
right <- j + error
paste("[",round(left,2),";",round(right,2),"]")
}
set.seed(123)
replicate(1000, z(10,1,1))
Where do I go from here?
Maybe this is what you're trying to do?
This z() will return the confidence interval for the population mean of a normal distribution.
z <- function(N, mu, std, cl=95) {
alpha <- (1-cl/100)/2
# CI for population mean
sep <- std/sqrt(N)
z_s <- qnorm(1 - alpha)
pop_lower <- mu - z_s*sep
pop_upper <- mu + z_s*sep
c(lower=pop_lower, upper=pop_upper)
}
Meaning that if I produce a random variate mean(rnorm(20, 0, 1)), then we expect the value of that to lie within z(20, 0, 1, 95) with probability 0.95.
To test this we can do
# specify parameters
N <- 20
mu <- 0
std <- 1
# produce a good number (10,000) of population means
set.seed(1)
r <- replicate(1e4, mean(rnorm(N, mu, std)))
# calculate confidence interval
ci <- z(N, mu, std)
# find which are below, within and above the interval
rc <- cut(r, c(min(r), ci, max(r)), c("below", "within", "above"))
# create a proportion table
round(prop.table(table(rc))*100, 2)
# below within above
# 2.59 95.08 2.33
I want to get the P-value of two randomly distributed observations x and y, for example :
> set.seed(0)
> x <- rnorm(1000, 3, 2)
> y <- rnorm(2000, 4, 3)
or:
> set.seed(0)
> x <- rexp(50, 10)
> y <- rexp(100, 11)
let's say that T is my test-statistic defined as mean(x) - mean(y) = 0 (this is H0), the P-value is then defined as : p-value = P[T>T_observed | H0 holds].
I tried doing this :
> z <- c(x,y) # if H0 holds then x and y are distributed with the same distribution
> f <- function(x) ecdf(z) # this will get the distribution of z (x and y)
then to calculate the p-value i tried this:
> T <- replicate(10000, mean(sample(z,1000,TRUE))-mean(sample(z,2000,TRUE))) # this is
supposed to get the null distribution of mean(x) - mean(y)
> f(quantile(T,0.05)) # calculating the p-value for a significance of 5%
obviously this doesn't seem to work, what am i missing ?
Your intention is very good -- to calculate statistical significance via bootstrap sampling (aka bootstrapping). However, the mean(sample(x,1000,TRUE))-mean(sample(z,2000,TRUE)) can't work because this is taking an average of 1000 samples of z - an average of 2000 samples of z. This will most certainly be quite close to 0 regardless of the true means of x and y.
I would suggest the following:
diff <- (sample(x, size = 2000, replace = TRUE) - sample(y, size = 2000, replace = TRUE))
2000 samples (with replacement) of both x and y are taken and the difference is calculated. Of course you can increase confidence too by adding replications as you suggested. As opposed to pvalue, I prefer confidence intervals (CI) as I think they are more informative (and equivalent in statistical accuracy to p-values). The CIs can then be calculated as follows using the means and standard errors:
stderror <- sd(diff)/sqrt(length(x))
upperCI <- mean(diff)+stderror
lowerCI <- mean(diff)-stderror
cat(lowerCI, upperCI)
Since the CI does not include 0, the null hypothesis is rejected. Notice that the result will be close to t-test (for your normal example) CI results in R:
t <- t.test(x, y)
cat(t$conf.int)
Suppose I have some data and I fit them to a gamma distribution, how to find the interval probability for Pr(1 < x <= 1.5), where x is an out-of-sample data point?
require(fitdistrplus)
a <- c(2.44121289,1.70292449,0.30550832,0.04332383,1.0553436,0.26912546,0.43590885,0.84514809,
0.36762336,0.94935435,1.30887437,1.08761895,0.66581035,0.83108270,1.7567334,1.00241339,
0.96263021,1.67488277,0.87400413,0.34639636,1.16804671,1.4182144,1.7378907,1.7462686,
1.7427784,0.8377457,0.1428738,0.71473956,0.8458882,0.2140742,0.9663167,0.7933085,
0.0475603,1.8657773,0.18307362,1.13519144)
fit <- fitdist(a, "gamma",lower = c(0, 0))
Someone does not like my above approach, which is conditional on MLE; now let's see something unconditional. If we take direct integration, we need a triple integration: one for shape, one for rate and finally one for x. This is not appealing. I will just produce Monte Carlo estimate instead.
Under Central Limit Theorem, MLE are normally distributed. fitdistrplus::fitdist does not give standard error, but we can use MASS::fitdistr which would performs exact inference here.
fit <- fitdistr(a, "gamma", lower = c(0,0))
b <- fit$estimate
# shape rate
#1.739737 1.816134
V <- fit$vcov ## covariance
shape rate
shape 0.1423679 0.1486193
rate 0.1486193 0.2078086
Now we would like to sample from parameter distribution and get samples of target probability.
set.seed(0)
## sample from bivariate normal with mean `b` and covariance `V`
## Cholesky method is used here
X <- matrix(rnorm(1000 * 2), 1000) ## 1000 `N(0, 1)` normal samples
R <- chol(V) ## upper triangular Cholesky factor of `V`
X <- X %*% R ## transform X under desired covariance
X <- X + b ## shift to desired mean
## you can use `cov(X)` to check it is very close to `V`
## now samples for `Pr(1 < x < 1.5)`
p <- pgamma(1.5, X[,1], X[,2]) - pgamma(1, X[,1], X[,2])
We can make a histogram of p (and maybe do a density estimation if you want):
hist(p, prob = TRUE)
Now, we often want sample mean for predictor:
mean(p)
# [1] 0.1906975
Here goes an example that uses MCMC techniques and a Bayesian mode of inference to estimate the posterior probability that a new observation falls in the interval (1:1.5). This is an unconditional estimate, as opposed to the conditional estimate obtained by integrating the gamma-distribution with maximum-likelihood parameter estimates.
This code requires that JAGS be installed on your computer (free and easy to install).
library(rjags)
a <- c(2.44121289,1.70292449,0.30550832,0.04332383,1.0553436,0.26912546,0.43590885,0.84514809,
0.36762336,0.94935435,1.30887437,1.08761895,0.66581035,0.83108270,1.7567334,1.00241339,
0.96263021,1.67488277,0.87400413,0.34639636,1.16804671,1.4182144,1.7378907,1.7462686,
1.7427784,0.8377457,0.1428738,0.71473956,0.8458882,0.2140742,0.9663167,0.7933085,
0.0475603,1.8657773,0.18307362,1.13519144)
# Specify the model in JAGS language using diffuse priors for shape and scale
sink("GammaModel.txt")
cat("model{
# Priors
shape ~ dgamma(.001,.001)
rate ~ dgamma(.001,.001)
# Model structure
for(i in 1:n){
a[i] ~ dgamma(shape, rate)
}
}
", fill=TRUE)
sink()
jags.data <- list(a=a, n=length(a))
# Give overdispersed initial values (not important for this simple model, but very important if running complicated models where you need to check convergence by monitoring multiple chains)
inits <- function(){list(shape=runif(1,0,10), rate=runif(1,0,10))}
# Specify which parameters to monitor
params <- c("shape", "rate")
# Set-up for MCMC run
nc <- 1 # number of chains
n.adapt <-1000 # number of adaptation steps
n.burn <- 1000 # number of burn-in steps
n.iter <- 500000 # number of posterior samples
thin <- 10 # thinning of posterior samples
# Running the model
gamma_mod <- jags.model('GammaModel.txt', data = jags.data, inits=inits, n.chains=nc, n.adapt=n.adapt)
update(gamma_mod, n.burn)
gamma_samples <- coda.samples(gamma_mod,params,n.iter=n.iter, thin=thin)
# Summarize the result
summary(gamma_samples)
# Compute improper (non-normalized) probability distribution for x
x <- rep(NA, 50000)
for(i in 1:50000){
x[i] <- rgamma(1, gamma_samples[[1]][i,1], rate = gamma_samples[[1]][i,2])
}
# Find which values of x fall in the desired range and normalize.
length(which(x>1 & x < 1.5))/length(x)
Answer:
Pr(1 < x <= 1.5) = 0.194
So pretty close to the conditional estimate, but this is not guaranteed to generally be the case.
You can just use pgamma with estimated parameters in fit.
b <- fit$estimate
# shape rate
#1.739679 1.815995
pgamma(1.5, b[1], b[2]) - pgamma(1, b[1], b[2])
# [1] 0.1896032
Thanks. But how about P(x > 2)?
Check out the lower.tail argument:
pgamma(q, shape, rate = 1, scale = 1/rate, lower.tail = TRUE, log.p = FALSE)
By default, pgamma(q) evaluates Pr(x <= q). Setting lower.tail = FALSE gives Pr(x > q). So you can do:
pgamma(2, b[1], b[2], lower.tail = FALSE)
# [1] 0.08935687
Or you can also use
1 - pgamma(2, b[1], b[2])
# [1] 0.08935687
What I have is a vector with different areas under the ROC curve (of different studies), e.g,
a <- c(.91, .85, .76, .89)
I also have the absolute number of participants in each study, e.g.,
n <- c(50, 34, 26, 47)
I calculated the weighted average for the areas with
weighted.mean(a, n)
Is there a way in R to also calculate the 95% confidence intervals of the weighted mean, based on the information I have? I looked into pROC, but as far as I understood it, there you need the raw data for each ROC curve (which I don't have). I would be very thankful for any suggestions!
weighted.ttest.ci <- function(x, weights, conf.level = 0.95) {
require(Hmisc)
nx <- length(x)
df <- nx - 1
vx <- wtd.var(x, weights, normwt = TRUE) ## From Hmisc
mx <- weighted.mean(x, weights)
stderr <- sqrt(vx/nx)
tstat <- mx/stderr ## not mx - mu
alpha <- 1 - conf.level
cint <- qt(1 - alpha/2, df)
cint <- tstat + c(-cint, cint)
cint * stderr
}
> weighted.ttest.ci(a,n)
[1] 0.7696 0.9627
Suppose we have a random sample of size n = 8 from a lognormal distribution with parameters mu and sigma. Since it is a small sample, from a non-normal population I will be using the t confidence interval. I ran a simulation to determine the true (simulated) CI of a 90% t-CI in which mu=1 and sigma= 1.5
My problem is that my code below follows a NORMAL distribution and it needs to be a lognormal distribution.
I know that rnorm has to become rlnorm so that the random variables come from the log distribution. But I need to change what mu and sigma are. Mu and sigma in normal distribution aren't the same in a log distribution.
Mu in the log distribution= exp(μ + 1/2 σ^2). And sigma is exp (2 (μ+sigma^2)) – exp2 (μ+sigma^2)
I'm just confused on how I can incorporate these two equations into my code.
BTW- if you didn't already guess, I am VERY new to R. Any help would be appreciated!
MC <- 10000 # Number of samples to simulate
result <- c(1:MC)
mu <- 1
sigma <- 1.5
n <- 8; # Sample size
alpha <- 0.1 # the nominal confidence level is 100(1-alpha) percent
t_criticalValue <- qt(p=(1-alpha/2), df=(n-1))
for(i in 1:MC){
mySample <- rlnorm(n=n, mean=mu, sd=sigma)
lowerCL <- mean(mySample)-t_criticalValue*sd(mySample)/sqrt(n)
upperCL <- mean(mySample)+t_criticalValue*sd(mySample)/sqrt(n)
result[i] <- ((lowerCL < mu) & (mu < upperCL))
}
SimulatedConfidenceLevel <- mean(result)
EDIT: So I tried replacing mu and sd with their respective formulas...
(mu=exp(μ + 1/2 σ2)
Sigma= exp(2μ + σ2)(exp(σ2) - 1)
and I got a simulatedconfidencelevel of 5000.
Here's some reproducible sample data:
(x <- rlnorm(8, 1, 1.5))
## [1] 3.5415832 0.3563604 0.5052436 3.5703968 7.3696985 0.7737094 12.9768734 35.9143985
Your definition of the critical value was correct:
n <- length(x)
alpha <- 0.1
t_critical_value <- qt(1 - alpha / 2, n - 1)
There's a utility function in the ggplot2 plotting package that calculates means and standard errors. In this case, you can apply it to the log of your data to find mu and it's confidence interval.
library(ggplot2)
mean_se(log(x), t_critical_value)
## y ymin ymax
## 1 1.088481 -0.006944755 2.183907