I'm finding differences when trying to calculate the CI in R :
x=c(25,30,15,45,22,54)
#IC 1
install.packages("Rmisc")
library(Rmisc)
CI(x,0.95) # [16.30429 ; 47.36238]
#IC2
lclm=mean(x)-(1.96*sd(x)/sqrt(length(x))) #19.99285
Uclm=mean(x)+(1.96*sd(x)/sqrt(length(x))) #43.67382
I want to know why I don't get same intervals with the two ways.
Thank you!
Your 1.96 is an approximation of the desired quantile from the standard normal distribution which is asymptotically equivalent to a student t-distribution as the sample size tends toward infinity. With your sample size of N = 6, there are considerable differences between the standard normal and a student's t distribution.
Here is the calculation of the desired quantile as per Stéphane's comment:
library(Rmisc)
x <- c(25, 30, 15, 45, 22, 54)
#IC 1
CI(x, 0.95)
#> upper mean lower
#> 47.36238 31.83333 16.30429
#IC2
m <- mean(x)
s <- sd(x)
n <- length(x)
q <- qt(1 - 0.05 / 2, n - 1)
c(
"upper" = m + q * s / sqrt(n),
"mean" = m,
"lower" = m - q * s / sqrt(n)
)
#> upper mean lower
#> 47.36238 31.83333 16.30429
Created on 2021-04-09 by the reprex package (v1.0.0)
Additional to the-mad-statter and Stéphane.
This is the function for the calculation of CI in Rmisc package:
function (x, ci = 0.95)
{
a <- mean(x)
s <- sd(x)
n <- length(x)
error <- qt(ci + (1 - ci)/2, df = n - 1) * s/sqrt(n)
return(c(upper = a + error, mean = a, lower = a - error))
}
Here you can find more deeper information:
https://stats.stackexchange.com/questions/467015/what-is-the-confidence-interval-formula-that-rmisc-package-in-r-uses
Related
I am implementing the Dirichlet Mixture Model using the EM algorithm in R, but am experiencing issues with the results. I generated two binomial distributions with fractions of (70%, 30%) and means of (0.05, 0.18), and trimmed 5% of the data set near 0. However, I am using a Beta distribution for clustering instead of a binomial distribution. Additionally, I am updating the mean and variance of the distributions rather than the alpha and beta parameters in order to impose constraints on the variance of each distribution.
I expected to obtain results similar to the ground truth settings, but instead I am getting pi values of (1, 0) and means of (0.09, 0.21). I am not sure if there are errors in my EM algorithm implementation or issues with parameter initialization.
I am including my R code for the data generation and DMM below. I would appreciate any help in identifying the cause of the problem and suggestions for how to resolve it.
library(dplyr)
library(data.table)
library(tidyverse)
set.seed(42)
#read count
cover <- 100
#Ground Truth Setting
subclone_f <- c(0.7, 0.3) # Ground Truth Setting - proportion
subclone_vaf <- c(0.05, 0.18) # Ground Truth Setting - mean
n_muts <- 45000
n_clone <-length(subclone_f)
#generating the virtual mutation notation: subclonal if 2, clonal if 1
mut_type <- sample.int(2, n_muts, prob = subclone_f, replace = TRUE)
mut_type
#generating negative binomial distribution(read count) for the given coverage
mut_reads <- rbinom(n_muts, cover, prob = subclone_vaf[mut_type]) %>% data.frame()
mut_reads
vaf <- (mut_reads/cover) %>% data.frame()
# Truncate the low count reads
n <- 0.95 * nrow(vaf) # cut-off setting
vaf_trim <- sapply(vaf, function(x) sort(x, decreasing = TRUE)[1:n])
colnames(vaf_trim) <- c("vaf")
hist(vaf_trim, breaks=seq(0,0.75,by=0.0001))
# Mixture Model
# Parameter Initialization (for 2 subclonality)
pi <- c(0.5, 0.5) # Mixture proportion weight: sums up to 1
alpha <- c(2,3)
beta <- c(20,5)
Mu[1] <- alpha[1] / (alpha[1] + beta[1])
Mu[2] <- alpha[2] / (alpha[2] + beta[2])
var[1] <- alpha[1]*beta[1] / ((alpha[1] + beta[1])^2 * (alpha[1] + beta[1] +1))
var[2] <- alpha[2]*beta[2] / ((alpha[2] + beta[2])^2 * (alpha[2] + beta[2] +1))
tau <-c(0.05, 0.05)
loglike[1] <- 0.5
loglike[2] <- 0.5
k <- 2
Nu <- 1/ (alpha + beta + 1) # control the variance: same across the distributions> Originally wanted to implement the same Nu for 2 distributions but I don't know how to do that.
n_cluster <- nrow(data.frame(pi))
logdbeta <- function(x, alpha, beta) {
sum(sapply(x, function(x) {dbeta(x, alpha, beta, log = TRUE)}))
}
estBetaParams <- function(mu, var) {
alpha <- ((1 - mu) / var - (1 / mu)) * mu ^ 2
beta <- alpha * (1 / mu - 1)
return(params = list(alpha = alpha, beta = beta))
}
# Loop for the EM algorithm
while(abs(loglike[k]-loglike[k-1]) >= 0.00001) {
# E step
total <- (pi[1]*dbeta(vaf_trim, alpha[1], beta[1])) + (pi[2]*dbeta(vaf_trim, alpha[2], beta[2]))
tau1 <- pi[1]*(dbeta(vaf_trim, alpha[1], beta[1]))/ total
tau2 <- pi[2]*(dbeta(vaf_trim, alpha[2], beta[2]))/ total
# M step
pi[1] <- sum(tau1)/length(vaf_trim) # Update Pi(weight)
pi[2] <- sum(tau2)/length(vaf_trim)
Mu[1] <- sum(tau1*vaf_trim)/sum(tau1) # Update Mu
Mu[2] <- sum(tau2*vaf_trim)/sum(tau2)
#Nu <- alpha + beta
Nu <- 1/ (alpha + beta + 1)
# Our main aim was to share the same coefficient for all dist
var[1] <- Mu[1] * (1-Mu[1]) * Nu[1] # Update Variance
var[2] <- Mu[2] * (1-Mu[2]) * Nu[2]
#Update in terms of alpha and beta
estBetaParams(Mu[1], var[1])
estBetaParams(Mu[2], var[2])
# Maximize the loglikelihood
loglike[k+1]<-sum(tau1*(log(pi[1])+logdbeta(vaf_trim,Mu[1],var[1])))+sum(tau2*(log(pi[2])+logdbeta(vaf_trim,Mu[2],var[2])))
k<-k+1
}
# Print estimates
EM <- data.table(param = c("pi", "Mean"), pi = pi, Mean = Mu)
knitr::kable(EM)
"Setting the seed at 747, generate m=500 samples of dimension n=880 from a population X, with Exponential distribution of expected value 1/λ=1/0.52, i.e. X∼Exp(λ=0.52).
For each of the generated samples, construct an approximate confidence interval for λ. Consider the confidence level 1−α=0.99.Indicate the mean of the amplitude of the m=500 confidence intervals obtained"
In this exercise I did:
m_ic <- function(seed, m, n, lambda, gama) {
set.seed(seed)
return(mean(replicate(m, (2*(qnorm((1+gama)/2)/sqrt(n)))/(mean(rexp(n ,lambda))))))
}
m_ic(seed=747, m=500, n=880, lambda=0.52, gama=0.99)
But what if n∈{100,200,300,…,4000} for example, how could i do the loop and put it in a data frame? And how could i after represent it in a plot like geom_line() or geom_point()?
You can use sapply() to provide a sequence of different values of n to your m_ic() function; save these in a vector and plot, like this:
n_vals = seq(100,4000,100)
m_ic_values = sapply(n_val, \(n) m_ic(seed=747,m=500,n=n, lambda=0.52, gama=0.99))
ggplot(NULL, aes(x=n_vals, y=m_ic_values)) +
geom_point() +
geom_line()
Output:
Your function m_ic is computing Normal confidence intervals but the exponential distribution is far from normal and a better confidence interval are Gamma intervals as you can see here. The function gamma_ic below computes these intervals and then the code to compute its amplitude is repeated in a sapply loop.
First the intervals's mean amplitude with n = 880.
gamma_ic <- function(x, conf = 0.95){
n <- length(x)
qlo <- (1 - conf)/2
qhi <- 1 - (1 - conf)/2
qq <- qgamma(c(qlo, qhi), n, n)/mean(x)
c(lower = qq[1], upper = qq[2])
}
n <- 880
m <- 500
lambda <- 0.52
set.seed(747)
x <- replicate(m, rexp(n, rate = lambda))
ci <- apply(x, 2, gamma_ic, conf = 0.99)
mean(apply(ci, 2, diff))
#> [1] 0.09059922
Created on 2022-06-12 by the reprex package (v2.0.1)
Now the amplitudes for n from 100 to 4000 with increments of 100.
n_vec <- seq(100L, 4000L, by = 100L)
ampl <- sapply(n_vec, \(n) {
set.seed(747)
x <- replicate(m, rexp(n, rate = lambda))
ci <- apply(x, 2, gamma_ic, conf = 0.99)
mean(apply(ci, 2, diff))
})
ampldata <- data.frame(n = n_vec, amplitude = ampl)
library(ggplot2)
ggplot(ampldata, aes(n, amplitude)) +
geom_line() +
geom_point() +
theme_bw()
Created on 2022-06-12 by the reprex package (v2.0.1)
Suppose I am seeking to integrate the following function from 0 to 10:
How would I accomplish this in R?
Functions
# Functional form
fn <- function(t) -100*(t)^2 + 20000
# First derivative w.r.t. t
fn_dt <- function(t) -200*t
# Density funciton phi
phi <- approxfun(density(rnorm(35, 15, 7)))
# Delta t
delta <- 5
How about the following:
First off, we choose a fixed seed for reproducibility.
# Density funciton phi
set.seed(2017);
phi <- approxfun(density(rnorm(35, 15, 7)))
We define the integrand.
integrand <- function(x) {
f1 <- -500 * x^2 + 100000;
f2 <- phi(x);
f2[is.na(f2)] <- 0;
return(f1 * f2)
}
By default, approxfun returns NA if x falls outside the interval [min(x), max(x)]; since phi is based on the density of a normal distribution, we can replace NAs with 0.
Let's plot the integrand
library(ggplot2);
ggplot(data.frame(x = 0), aes(x)) + stat_function(fun = integrand) + xlim(-50, 50);
We use integrate to calculate the integral; here I assume you are interested in the interval [-Inf, +Inf].
integrate(integrand, lower = -Inf, upper = Inf)
#-39323.06 with absolute error < 4.6
I was wondering how I could check via simulation in R that the 95% Confidence Interval obtained from a binomial test with 5 successes in 15 trials when TRUE p = .5 has a 95% "Coverage Probability" in the long-run?
Here is the 95% CI for such a test using R (how can show that the following CI has a 95% coverage in the long-run if TRUE p = .5):
as.numeric(binom.test(x = 5, n = 15, p = .5)[[4]])
# > [1] 0.1182411 0.6161963 (in the long-run 95% of the time, ".5" is contained within these
# two numbers, how to show this in R?)
Something like this?
fun <- function(n = 15, p = 0.5){
x <- rbinom(1, size = n, prob = p)
res <- binom.test(x, n, p)[[4]]
c(Lower = res[1], Upper = res[2])
}
set.seed(3183)
R <- 10000
sim <- t(replicate(R, fun()))
Note that binom.test when called with 5 successes, 15 trials and p = 0.5 will always return the same value, hence the call to rbinom. The number of successes will vary. We can compute the proportion of cases when p is between Lower and Upper.
cov <- mean(sim[,1] <= .5 & .5 <= sim[,2])
Given a Laplace Distribution proposal:
g(x) = 1/2*e^(-|x|)
and sample size n = 1000, I want to Conduct the Monte Carlo (MC) integration for estimating θ:
via importance sampling. Eventually I want to calculate the mean and standard deviation of this MC estimate in R once I get there.
Edit (arrived late after the answer below)
This is what I have for my R code so far:
library(VGAM)
n = 1000
x = rexp(n,0.5)
hx = mean(2*exp(-sqrt(x))*(sin(x))^2)
gx = rlaplace(n, location = 0, scale = 1)
Now we can write a simple R function to sample from Laplace distribution:
## `n` is sample size
rlaplace <- function (n) {
u <- runif(n, 0, 1)
ifelse(u < 0.5, log(2 * u), -log(2* (1 - u)))
}
Also write a function for density of Laplace distribution:
g <- function (x) ifelse(x < 0, 0.5 * exp(x), 0.5 * exp(-x))
Now, your integrand is:
f <- function (x) {
ifelse(x > 0, exp(-sqrt(x) - 0.5 * x) * sin(x) ^ 2, 0)
}
Now we estimate the integral using 1000 samples (set.seed for reproducibility):
set.seed(0)
x <- rlaplace(1000)
mean(f(x) / g(x))
# [1] 0.2648853
Also compare with numerical integration using quadrature:
integrate(f, lower = 0, upper = Inf)
# 0.2617744 with absolute error < 1.6e-05