R : Calculate a P-value of a random distribution - r

I want to get the P-value of two randomly distributed observations x and y, for example :
> set.seed(0)
> x <- rnorm(1000, 3, 2)
> y <- rnorm(2000, 4, 3)
or:
> set.seed(0)
> x <- rexp(50, 10)
> y <- rexp(100, 11)
let's say that T is my test-statistic defined as mean(x) - mean(y) = 0 (this is H0), the P-value is then defined as : p-value = P[T>T_observed | H0 holds].
I tried doing this :
> z <- c(x,y) # if H0 holds then x and y are distributed with the same distribution
> f <- function(x) ecdf(z) # this will get the distribution of z (x and y)
then to calculate the p-value i tried this:
> T <- replicate(10000, mean(sample(z,1000,TRUE))-mean(sample(z,2000,TRUE))) # this is
supposed to get the null distribution of mean(x) - mean(y)
> f(quantile(T,0.05)) # calculating the p-value for a significance of 5%
obviously this doesn't seem to work, what am i missing ?

Your intention is very good -- to calculate statistical significance via bootstrap sampling (aka bootstrapping). However, the mean(sample(x,1000,TRUE))-mean(sample(z,2000,TRUE)) can't work because this is taking an average of 1000 samples of z - an average of 2000 samples of z. This will most certainly be quite close to 0 regardless of the true means of x and y.
I would suggest the following:
diff <- (sample(x, size = 2000, replace = TRUE) - sample(y, size = 2000, replace = TRUE))
2000 samples (with replacement) of both x and y are taken and the difference is calculated. Of course you can increase confidence too by adding replications as you suggested. As opposed to pvalue, I prefer confidence intervals (CI) as I think they are more informative (and equivalent in statistical accuracy to p-values). The CIs can then be calculated as follows using the means and standard errors:
stderror <- sd(diff)/sqrt(length(x))
upperCI <- mean(diff)+stderror
lowerCI <- mean(diff)-stderror
cat(lowerCI, upperCI)
Since the CI does not include 0, the null hypothesis is rejected. Notice that the result will be close to t-test (for your normal example) CI results in R:
t <- t.test(x, y)
cat(t$conf.int)

Related

Testing confidence intervals in R

I currently have constructed a 95% confidence interval and have then used replicate() to randomly generate 1000 confidence intervals. I want to measure how many of the intervals contain my mean. I know in theory it should be in 950 of them but how do I get a definite answer? The function I used and the mean are listed below.
z <- function(a,b,c){
error <- rnorm(a, b, c) * c / sqrt(a)
left <- b - error
right <- j + error
paste("[",round(left,2),";",round(right,2),"]")
}
set.seed(123)
replicate(1000, z(10,1,1))
Where do I go from here?
Maybe this is what you're trying to do?
This z() will return the confidence interval for the population mean of a normal distribution.
z <- function(N, mu, std, cl=95) {
alpha <- (1-cl/100)/2
# CI for population mean
sep <- std/sqrt(N)
z_s <- qnorm(1 - alpha)
pop_lower <- mu - z_s*sep
pop_upper <- mu + z_s*sep
c(lower=pop_lower, upper=pop_upper)
}
Meaning that if I produce a random variate mean(rnorm(20, 0, 1)), then we expect the value of that to lie within z(20, 0, 1, 95) with probability 0.95.
To test this we can do
# specify parameters
N <- 20
mu <- 0
std <- 1
# produce a good number (10,000) of population means
set.seed(1)
r <- replicate(1e4, mean(rnorm(N, mu, std)))
# calculate confidence interval
ci <- z(N, mu, std)
# find which are below, within and above the interval
rc <- cut(r, c(min(r), ci, max(r)), c("below", "within", "above"))
# create a proportion table
round(prop.table(table(rc))*100, 2)
# below within above
# 2.59 95.08 2.33

R. Bootstrapping max value from a vector

I have a data frame df with a X column with normally distributed values along 1,000,000 rows. The max value in X = 0.8. Using R (and perhaps the "boot" package), I would like to do bootstrapping with replacement to estimate how unlikely is to get max(df$X)=0.8 from my data. For this, I could take n bootstrap samples from X and calculate the max value of each sample. Then I can take the standard deviation of each max(sample) and see how far is 0.8 from this st dev. Does anyone know how to do this bootstrapping with R?. Any suggestion is welcomed !
Bootstrapping from x, where x is a normal random variable. statistic function needs to be provided which requires at least data and indices as its arguments. check the R documentation of boot package for more details.
max_x function below checks if the max(x) is same as maximum of a bootsrapped sample. Note that the test data (x) considered in below code has a different maximum value, but conceptual framework remains the same:
set.seed(101)
x <- rnorm(1000, mean= 0.4, sd= 0.2) # normally distributed test data
max_x <- function(data, indices){ m <- max(data[indices])
if (m == max(x)) { return(1)
} else{ return(0)}
}
results <- boot(data = x, statistic = max_x, R = 1000) # 1000 replications
mean(results$t == 1) # probability of max getting sampled
# 0.618
results
# ORDINARY NONPARAMETRIC BOOTSTRAP
# Call:
# boot(data = x, statistic = max_x, R = 1000)
# Bootstrap Statistics :
# original bias std. error
# t1* 1 -0.382 0.4861196
plot(results)

How to find interval prbability for a given distribution?

Suppose I have some data and I fit them to a gamma distribution, how to find the interval probability for Pr(1 < x <= 1.5), where x is an out-of-sample data point?
require(fitdistrplus)
a <- c(2.44121289,1.70292449,0.30550832,0.04332383,1.0553436,0.26912546,0.43590885,0.84514809,
0.36762336,0.94935435,1.30887437,1.08761895,0.66581035,0.83108270,1.7567334,1.00241339,
0.96263021,1.67488277,0.87400413,0.34639636,1.16804671,1.4182144,1.7378907,1.7462686,
1.7427784,0.8377457,0.1428738,0.71473956,0.8458882,0.2140742,0.9663167,0.7933085,
0.0475603,1.8657773,0.18307362,1.13519144)
fit <- fitdist(a, "gamma",lower = c(0, 0))
Someone does not like my above approach, which is conditional on MLE; now let's see something unconditional. If we take direct integration, we need a triple integration: one for shape, one for rate and finally one for x. This is not appealing. I will just produce Monte Carlo estimate instead.
Under Central Limit Theorem, MLE are normally distributed. fitdistrplus::fitdist does not give standard error, but we can use MASS::fitdistr which would performs exact inference here.
fit <- fitdistr(a, "gamma", lower = c(0,0))
b <- fit$estimate
# shape rate
#1.739737 1.816134
V <- fit$vcov ## covariance
shape rate
shape 0.1423679 0.1486193
rate 0.1486193 0.2078086
Now we would like to sample from parameter distribution and get samples of target probability.
set.seed(0)
## sample from bivariate normal with mean `b` and covariance `V`
## Cholesky method is used here
X <- matrix(rnorm(1000 * 2), 1000) ## 1000 `N(0, 1)` normal samples
R <- chol(V) ## upper triangular Cholesky factor of `V`
X <- X %*% R ## transform X under desired covariance
X <- X + b ## shift to desired mean
## you can use `cov(X)` to check it is very close to `V`
## now samples for `Pr(1 < x < 1.5)`
p <- pgamma(1.5, X[,1], X[,2]) - pgamma(1, X[,1], X[,2])
We can make a histogram of p (and maybe do a density estimation if you want):
hist(p, prob = TRUE)
Now, we often want sample mean for predictor:
mean(p)
# [1] 0.1906975
Here goes an example that uses MCMC techniques and a Bayesian mode of inference to estimate the posterior probability that a new observation falls in the interval (1:1.5). This is an unconditional estimate, as opposed to the conditional estimate obtained by integrating the gamma-distribution with maximum-likelihood parameter estimates.
This code requires that JAGS be installed on your computer (free and easy to install).
library(rjags)
a <- c(2.44121289,1.70292449,0.30550832,0.04332383,1.0553436,0.26912546,0.43590885,0.84514809,
0.36762336,0.94935435,1.30887437,1.08761895,0.66581035,0.83108270,1.7567334,1.00241339,
0.96263021,1.67488277,0.87400413,0.34639636,1.16804671,1.4182144,1.7378907,1.7462686,
1.7427784,0.8377457,0.1428738,0.71473956,0.8458882,0.2140742,0.9663167,0.7933085,
0.0475603,1.8657773,0.18307362,1.13519144)
# Specify the model in JAGS language using diffuse priors for shape and scale
sink("GammaModel.txt")
cat("model{
# Priors
shape ~ dgamma(.001,.001)
rate ~ dgamma(.001,.001)
# Model structure
for(i in 1:n){
a[i] ~ dgamma(shape, rate)
}
}
", fill=TRUE)
sink()
jags.data <- list(a=a, n=length(a))
# Give overdispersed initial values (not important for this simple model, but very important if running complicated models where you need to check convergence by monitoring multiple chains)
inits <- function(){list(shape=runif(1,0,10), rate=runif(1,0,10))}
# Specify which parameters to monitor
params <- c("shape", "rate")
# Set-up for MCMC run
nc <- 1 # number of chains
n.adapt <-1000 # number of adaptation steps
n.burn <- 1000 # number of burn-in steps
n.iter <- 500000 # number of posterior samples
thin <- 10 # thinning of posterior samples
# Running the model
gamma_mod <- jags.model('GammaModel.txt', data = jags.data, inits=inits, n.chains=nc, n.adapt=n.adapt)
update(gamma_mod, n.burn)
gamma_samples <- coda.samples(gamma_mod,params,n.iter=n.iter, thin=thin)
# Summarize the result
summary(gamma_samples)
# Compute improper (non-normalized) probability distribution for x
x <- rep(NA, 50000)
for(i in 1:50000){
x[i] <- rgamma(1, gamma_samples[[1]][i,1], rate = gamma_samples[[1]][i,2])
}
# Find which values of x fall in the desired range and normalize.
length(which(x>1 & x < 1.5))/length(x)
Answer:
Pr(1 < x <= 1.5) = 0.194
So pretty close to the conditional estimate, but this is not guaranteed to generally be the case.
You can just use pgamma with estimated parameters in fit.
b <- fit$estimate
# shape rate
#1.739679 1.815995
pgamma(1.5, b[1], b[2]) - pgamma(1, b[1], b[2])
# [1] 0.1896032
Thanks. But how about P(x > 2)?
Check out the lower.tail argument:
pgamma(q, shape, rate = 1, scale = 1/rate, lower.tail = TRUE, log.p = FALSE)
By default, pgamma(q) evaluates Pr(x <= q). Setting lower.tail = FALSE gives Pr(x > q). So you can do:
pgamma(2, b[1], b[2], lower.tail = FALSE)
# [1] 0.08935687
Or you can also use
1 - pgamma(2, b[1], b[2])
# [1] 0.08935687

random effects variance of intercept being zero

I am running a power analysis using a normal LMM in R. I have seven input parameters, two of which I do not need to test for (no. of years and no. of sites). The other 5 parameters are the intercept, slope and the random effects standard deviation of the residual, intercept and slope.
Given that my response data (year is the sole explanatory variable in the model) is bound between (-1, +1), the intercept also falls in this range. However, what I am finding is that if I run, say, 1000 simulations with a given intercept and slope (which I am treating as constant over 10 years), then if the random effects intercept SD falls below a certain value, there are many simulations where the random effects intercept SD is zero. If I inflate the intercept SD then this seems to simulate correctly (please see below where I use residual Sd=0.25, intercept SD = 0.10 and slope SD = 0.05; if I increase intercept SD to 0.2, this is correctly simulated; or if I drop the residual SD to say 0.05, the variance parameters are correctly simulated).
Is this problem due to my coercion of the range to be (-1, +1)?
I include the code for my function and the processing of the simulations below, if this would help:
Function: generating the data:
normaldata <- function (J, K, beta0, beta1, sigma_resid,
sigma_beta0, sigma_beta1){
year <- rep(rep(0:J),K) # 0:J replicated K times
site <- rep (1:K, each=(J+1)) # 1:K sites, repeated J years
mu.beta0_true <- beta0
mu.beta1_true <- beta1
# random effects variance parameters:
sigma_resid_true <- sigma_resid
sigma_beta0_true <- sigma_beta0
sigma_beta1_true <- sigma_beta1
# site-level parameters:
beta0_true <<- rnorm(K, mu.beta0_true, sigma_beta0_true)
beta1_true <<- rnorm(K, mu.beta1_true, sigma_beta1_true)
# data
y <<- rnorm(n = (J+1)*K, mean = beta0_true[site] + beta1_true[site]*(year),
sd = sigma_resid_true)
# NOT SURE WHETHER TO IMPOSE THE LIMITS HERE OR LATER IN CODE:
y[y < -1] <- -1 # Absolute minimum
y[y > 1] <- 1 # Absolute maximum
return(data.frame(y, year, site))
}
Processing the simulated code:
vc1 <- as.data.frame(VarCorr(lme.power))
vc2 <- as.numeric(attributes(VarCorr(lme.power)$site)$stddev)
n.sims = 1000
sigma.resid <- rep(0, n.sims)
sigma.intercept <- rep(0, n.sims)
sigma.slope <- rep(0,n.sims)
intercept <- rep(0,n.sims)
slope <- rep(0,n.sims)
signif <- rep(0,n.sims)
for (s in 1:n.sims){
y.data <- normaldata(10,200, 0.30, ((0-0.30)/10), 0.25, 0.1, 0.05)
lme.power <- lmer(y ~ year + (1+year | site), data=y.data)
summary(lme.power)
theta.hat <- fixef(lme.power)[["year"]]
theta.se <- se.fixef(lme.power)[["year"]]
signif[s] <- ((theta.hat + 1.96*theta.se) < 0) |
((theta.hat - 1.96*theta.se) > 0) # returns TRUE or FALSE
signif[s]
betas <- fixef(lme.power)
intercept[s] <- betas[1]
slope[s] <- betas[2]
vc1 <- as.data.frame(VarCorr(lme.power))
vc2 <- as.numeric(attributes(VarCorr(lme.power)$site)$stddev)
sigma.resid[s] <- vc1[4,5]
sigma.intercept[s] <- vc2[1]
sigma.slope[s] <- vc2[2]
cat(paste(s, " ")); flush.console()
}
power <- mean (signif) # proportion of TRUE
power
summary(sigma.resid)
summary(sigma.intercept)
summary(sigma.slope)
summary(intercept)
summary(slope)
Thank you in advance for any help you can offer.
This is really more of a statistical than a computational question, but the short answer is: you haven't made any mistakes, this is exactly as expected. This example on rpubs runs some simulations of a Normally distributed response (i.e. it corresponds exactly to the model assumed by LMM software, so the constraint you're worried about isn't an issue).
The lefthand histogram below is from simulations with 25 samples in 5 groups, equal variance (of 1) within and between groups; the righthand histogram is from simulations with 15 samples in 3 groups.
The sampling distribution of variances for null cases (i.e., no real between-group variation) is known to have a point mass or "spike" at zero; it's not surprising (although as far as I know not theoretically worked out) that the sampling distribution of the variances should also have a point mass at zero when the between-sample is non-zero but small and/or when the sample is small and/or noisy.
http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#zero-variance has more on this topic.

Confidence Interval for Mu in a Log normal Distributions in R

Suppose we have a random sample of size n = 8 from a lognormal distribution with parameters mu and sigma. Since it is a small sample, from a non-normal population I will be using the t confidence interval. I ran a simulation to determine the true (simulated) CI of a 90% t-CI in which mu=1 and sigma= 1.5
My problem is that my code below follows a NORMAL distribution and it needs to be a lognormal distribution.
I know that rnorm has to become rlnorm so that the random variables come from the log distribution. But I need to change what mu and sigma are. Mu and sigma in normal distribution aren't the same in a log distribution.
Mu in the log distribution= exp(μ + 1/2 σ^2). And sigma is exp (2 (μ+sigma^2)) – exp2 (μ+sigma^2)
I'm just confused on how I can incorporate these two equations into my code.
BTW- if you didn't already guess, I am VERY new to R. Any help would be appreciated!
MC <- 10000 # Number of samples to simulate
result <- c(1:MC)
mu <- 1
sigma <- 1.5
n <- 8; # Sample size
alpha <- 0.1 # the nominal confidence level is 100(1-alpha) percent
t_criticalValue <- qt(p=(1-alpha/2), df=(n-1))
for(i in 1:MC){
mySample <- rlnorm(n=n, mean=mu, sd=sigma)
lowerCL <- mean(mySample)-t_criticalValue*sd(mySample)/sqrt(n)
upperCL <- mean(mySample)+t_criticalValue*sd(mySample)/sqrt(n)
result[i] <- ((lowerCL < mu) & (mu < upperCL))
}
SimulatedConfidenceLevel <- mean(result)
EDIT: So I tried replacing mu and sd with their respective formulas...
(mu=exp(μ + 1/2 σ2)
Sigma= exp(2μ + σ2)(exp(σ2) - 1)
and I got a simulatedconfidencelevel of 5000.
Here's some reproducible sample data:
(x <- rlnorm(8, 1, 1.5))
## [1] 3.5415832 0.3563604 0.5052436 3.5703968 7.3696985 0.7737094 12.9768734 35.9143985
Your definition of the critical value was correct:
n <- length(x)
alpha <- 0.1
t_critical_value <- qt(1 - alpha / 2, n - 1)
There's a utility function in the ggplot2 plotting package that calculates means and standard errors. In this case, you can apply it to the log of your data to find mu and it's confidence interval.
library(ggplot2)
mean_se(log(x), t_critical_value)
## y ymin ymax
## 1 1.088481 -0.006944755 2.183907

Resources