I am trying to bridge sampling to sample a gamma process for the interval (0:T) for a 2^{m} partition length.T=T_{2^m}, m=number of bridges. I am using R studio but my results are all NA's, can someone please help me, I have attached a picture of the algorithm.
set.seed(1)
M <- 7
nt <- 2^M + 1 #number of observations
T <- nt - 1 #total time
T_ <- seq(0, T, length.out=nt) #fixed time increments
G_[T] <- rgamma(1, shape=T/nu, scale=nu)
G_[1] <- 0
for (m in 1:M){
for (j in 2:2^(m-1)){
i <- (2*j-1)
Y <- rbeta(1,T/(nu*2^m), T/(nu*2^m))
G_[i*T/2^m] <- G_[(i-1)^T/2^m] + (G_[(i+1)*T/2^m] - G_[(i-1)*T/2^m])*Y
}
}
Related
I am implementing the Dirichlet Mixture Model using the EM algorithm in R, but am experiencing issues with the results. I generated two binomial distributions with fractions of (70%, 30%) and means of (0.05, 0.18), and trimmed 5% of the data set near 0. However, I am using a Beta distribution for clustering instead of a binomial distribution. Additionally, I am updating the mean and variance of the distributions rather than the alpha and beta parameters in order to impose constraints on the variance of each distribution.
I expected to obtain results similar to the ground truth settings, but instead I am getting pi values of (1, 0) and means of (0.09, 0.21). I am not sure if there are errors in my EM algorithm implementation or issues with parameter initialization.
I am including my R code for the data generation and DMM below. I would appreciate any help in identifying the cause of the problem and suggestions for how to resolve it.
library(dplyr)
library(data.table)
library(tidyverse)
set.seed(42)
#read count
cover <- 100
#Ground Truth Setting
subclone_f <- c(0.7, 0.3) # Ground Truth Setting - proportion
subclone_vaf <- c(0.05, 0.18) # Ground Truth Setting - mean
n_muts <- 45000
n_clone <-length(subclone_f)
#generating the virtual mutation notation: subclonal if 2, clonal if 1
mut_type <- sample.int(2, n_muts, prob = subclone_f, replace = TRUE)
mut_type
#generating negative binomial distribution(read count) for the given coverage
mut_reads <- rbinom(n_muts, cover, prob = subclone_vaf[mut_type]) %>% data.frame()
mut_reads
vaf <- (mut_reads/cover) %>% data.frame()
# Truncate the low count reads
n <- 0.95 * nrow(vaf) # cut-off setting
vaf_trim <- sapply(vaf, function(x) sort(x, decreasing = TRUE)[1:n])
colnames(vaf_trim) <- c("vaf")
hist(vaf_trim, breaks=seq(0,0.75,by=0.0001))
# Mixture Model
# Parameter Initialization (for 2 subclonality)
pi <- c(0.5, 0.5) # Mixture proportion weight: sums up to 1
alpha <- c(2,3)
beta <- c(20,5)
Mu[1] <- alpha[1] / (alpha[1] + beta[1])
Mu[2] <- alpha[2] / (alpha[2] + beta[2])
var[1] <- alpha[1]*beta[1] / ((alpha[1] + beta[1])^2 * (alpha[1] + beta[1] +1))
var[2] <- alpha[2]*beta[2] / ((alpha[2] + beta[2])^2 * (alpha[2] + beta[2] +1))
tau <-c(0.05, 0.05)
loglike[1] <- 0.5
loglike[2] <- 0.5
k <- 2
Nu <- 1/ (alpha + beta + 1) # control the variance: same across the distributions> Originally wanted to implement the same Nu for 2 distributions but I don't know how to do that.
n_cluster <- nrow(data.frame(pi))
logdbeta <- function(x, alpha, beta) {
sum(sapply(x, function(x) {dbeta(x, alpha, beta, log = TRUE)}))
}
estBetaParams <- function(mu, var) {
alpha <- ((1 - mu) / var - (1 / mu)) * mu ^ 2
beta <- alpha * (1 / mu - 1)
return(params = list(alpha = alpha, beta = beta))
}
# Loop for the EM algorithm
while(abs(loglike[k]-loglike[k-1]) >= 0.00001) {
# E step
total <- (pi[1]*dbeta(vaf_trim, alpha[1], beta[1])) + (pi[2]*dbeta(vaf_trim, alpha[2], beta[2]))
tau1 <- pi[1]*(dbeta(vaf_trim, alpha[1], beta[1]))/ total
tau2 <- pi[2]*(dbeta(vaf_trim, alpha[2], beta[2]))/ total
# M step
pi[1] <- sum(tau1)/length(vaf_trim) # Update Pi(weight)
pi[2] <- sum(tau2)/length(vaf_trim)
Mu[1] <- sum(tau1*vaf_trim)/sum(tau1) # Update Mu
Mu[2] <- sum(tau2*vaf_trim)/sum(tau2)
#Nu <- alpha + beta
Nu <- 1/ (alpha + beta + 1)
# Our main aim was to share the same coefficient for all dist
var[1] <- Mu[1] * (1-Mu[1]) * Nu[1] # Update Variance
var[2] <- Mu[2] * (1-Mu[2]) * Nu[2]
#Update in terms of alpha and beta
estBetaParams(Mu[1], var[1])
estBetaParams(Mu[2], var[2])
# Maximize the loglikelihood
loglike[k+1]<-sum(tau1*(log(pi[1])+logdbeta(vaf_trim,Mu[1],var[1])))+sum(tau2*(log(pi[2])+logdbeta(vaf_trim,Mu[2],var[2])))
k<-k+1
}
# Print estimates
EM <- data.table(param = c("pi", "Mean"), pi = pi, Mean = Mu)
knitr::kable(EM)
I need tips of how to calculate the power function of the Chi Square Goodness of Fit test using Monte Carlo Simulations in R. I am familiar with the pwr.chisq function but i need a way to use R to write the code for the Monte Carlo simulation.
I can do it for the T-test as follows:
n <- 100
mean_true <- 17
sd_true <- 2
## Null-Hypothesis (H0: mean_true = mean_0):
mean_0 <- seq(16, 18, len=15)
alpha <- 0.05
B <- 1000
Empirical_Power <- rep(NA, length(mean_0))
for(j in 1:length(Empirical_Power)){
Test_Decisions <- rep(NA, B)
for(i in 1:B){
dat_X <- rnorm(n=n, mean=mean_true, sd = sd_true)
t.Test_result <- t.test(x = dat_X, alternative = "two.sided", mu = mean_0[j])
Test_Decisions[i] <- t.Test_result$p.value < alpha
}
Number_of_Rejections <- length(Test_Decisions[Test_Decisions==TRUE])
Empirical_Power[j] <- Number_of_Rejections/B
}
I need a similar way for Chi-Square and it doesn't seem to work. That's how far i got but clearly it's wrong because no sense can be made from the results:
n <- 100
Frequency_true <- c(50,60,40,47,53)
sd_true <- 2
Frequency_0 <- c(0.2,0.2,0.2,0.2,0.2)
alpha <- 0.05
B <- 1000
Empirical_Power <- rep(NA, length(Frequency_0))
for(j in 1:length(Empirical_Power)){
Test_Decisions <- rep(NA, B)
for(i in 1:B){
dat_X <- rchisq(100000, df=99)
Chisq_result <- chisq.test(x = Frequency_true, p= Frequency_0)
Test_Decisions[i] <- Chisq_result$p.value < alpha
}
Number_of_Rejections <- length(Test_Decisions[Test_Decisions==TRUE])
Empirical_Power[j] <- Number_of_Rejections/B
}
The experiment is to draw 250 balls from a urn with 5 different types of balls where all types of balls are equi-probably drawn. The counts of types of balls drawn are given in vector Frequency_true, that follows a multinomial distribution.
So, in order to have the simulated power of the test, simulate B draws with a fixed total count of 250, using rmultinom, run chi-squared tests of Goodness-of-Fit and compute the proportion of p-values below the significance level alpha.
sim_p_value <- function(B, freq, prob){
Sum <- sum(freq)
x <- rmultinom(B, size = Sum, prob = prob)
apply(x, 2, \(y) chisq.test(y, p = prob)$p.value)
}
Frequency_true <- c(50,60,40,47,53)
Frequency_0 <- c(0.2,0.2,0.2,0.2,0.2)
alpha <- 0.05
B <- 1000
set.seed(2022)
pval <- sim_p_value(B, Frequency_true, Frequency_0)
Empirical_Power <- mean(pval < alpha)
Empirical_Power
#> [1] 0.16
Created on 2022-07-09 by the reprex package (v2.0.1)
I am using trying to use bridge sampling in R studio to simulate paths for the variance gamma process. My code is:
sigma = 0.5054
theta = 0.2464
nu = 0.1184
mu=1
N=2^(k)
k=5
V_<-rep(NA,252)
V_[0]<-0
G_[N]<-rgamma(1, shape=N*1/nu, scale=nu)
G_<-0
V<-rnorm(theta*G[N],sigma^2*G[N])
for(l in 1:k){
n<-2^(k-l)
for(j in 1:2^i-1){
i<-(2*j-1)*n
d1<-(n)*mu^2/nu
d2<-(n)*mu^2/nu
Y<-rbeta(1,d1,d2)
G_[i]<-G_[i-1]+(G[i+n]-G[i-n])*Y
G[i]
print(G_[i])
Z<-rnorm(0,(G_[i+n]-G_[i])*sigma^2*Y)
V_[i]<-Y*V_[i+n]+(1-Y)*V_[i-n]+Z
print(V_[i])
}
}
ts.plot(V[i])
I'm not sure what I've done wrong. The algorithm I am trying to follow is as below in the picture:
Based on your code, a numerical sequence was simulated. And it can be roughly validated by using VarianceGamma::vgFit to estimate the parameters.
Note that the time index starts from 1 due to R syntax. The sqrt of variance was used for the standard deviation in rnorm. And I probably shouldn't add the change due to interest rate vgC in the end, since it is not included in your algorithm. Please set it as 0 if it doesn't make sense.
Simulation by Brownian bridge:
# Brownian-Gamma Bridge Sampling (BGBS) of a VG process
set.seed(1)
M <- 10
nt <- 2^M + 1 #number of observations
T <- nt - 1 #total time
T_ <- seq(0, T, length.out=nt) #fixed time increments
#random time increments
#T_ = c(0, runif(nt-2), 1)
#T_ = sort(T_) * T
r <- 1 + 0.2 #interest rate
vgC <- (r-1)
sigma <- 0.5054
theta <- 0.2464
nu <- 0.1184
V_ <- G_ <- rep(NA,nt)
V_[1] <- 0
G_[1] <- 0
G_[nt] <- rgamma(1, shape=T/nu, scale=nu)
V_[nt] <- rnorm(1, theta*G_[nt], sqrt(sigma^2*G_[nt]))
for (k in 1:M)
{
n <- 2^(M-k)
for (j in 1:2^(k-1))
{
i <- (2*j-1) * n
Y <- rbeta(1, (T_[i+1]-T_[i-n+1])/nu, (T_[i+n+1]-T_[i+1])/nu)
G_[i+1] <- G_[i-n+1] + (G_[i+n+1] - G_[i-n+1]) * Y
Z <- rnorm(1, sd=sqrt((G_[i+n+1] - G_[i+1]) * sigma^2 * Y))
V_[i+1] <- Y * V_[i+n+1] + (1-Y) * V_[i-n+1] + Z
}
}
V_ <- V_ + vgC*T_ # changes due to interest rate
plot(T_, V_)
The results roughly match with the estimation:
#Estimated parameters:
library(VarianceGamma)
dV <- V_[2:nt] - V_[1:(nt-1)]
vgFit(dV)
> vgC sigma theta nu
> 0.2996 0.5241 0.1663 0.1184
#Real parameters:
c(vgC, sigma, theta, nu)
> vgC sigma theta nu
> 0.2000 0.5054 0.2464 0.1184
EDIT
As you commented, there is another similar algorithm and can be implemented in a similar way.
Your code could be modified as below:
set.seed(1)
M <- 7
nt <- 2^M + 1
T <- nt - 1
T_ <- seq(0, T, length.out=nt)
sigma=0.008835
theta= -0.003856
nu=0.263743
vgc=0.004132
V_ <- G_ <- rep(1,nt)
G_[T+1] <- rgamma(1, shape=T/nu, scale=nu) #
V_[T+1] <- rnorm(1, theta*G_[T+1], sqrt(sigma^2*G_[T+1])) #
V_[1] <- 0
G_[1] <- 0
for (m in 1:M){ #
Y <- rbeta(1,T/(2^m*nu), T/(2^m*nu))
for (j in 1:2^(m-1)){ #
i <- (2*j-1)
G_[i*T/(2^m)+1] = G_[(i-1)*T/(2^m)+1]+(-G_[(i-1)*T/(2^m)+1]+G_[(i+1)*T/(2^m)+1])*Y #
b=G_[T*(i+1)/2^m+1] - G_[T*(i)/2^m+1] #
Z_i <- rnorm(1, sd=b*sigma^2*Y)
#V_[i] <- Y* V_[i+1] + (1-Y)*V_[i-1] + Z_i
V_[i*T/(2^m)+1] <- Y* V_[(i+1)*T/(2^m)+1] + (1-Y)*V_[(i-1)*T/(2^m)+1] + Z_i
}
}
V_ <- V_ + vgc*T_
V_
ts.plot(V_, main="BRIDGE", xlab="Time increment")
Ryan again, I have found another algorithm for bridge sampling which I tried on my own, But I am not convinced that my answers are correct. I have added my code, output and algorithm below and also the output I think it should loom like? I have used a similar format to your code:
set.seed(1)
M <- 7
nt <- 2^M + 1 #number of observations
T <- nt - 1 #total time
T_ <- seq(0, T, length.out=nt) #fixed time increments
sigma=0.008835
theta= -0.003856
nu=0.263743
vgc=0.004132
V_ <- G_ <- rep(1,nt)
G_[T] <- rgamma(1, shape=T/nu, scale=nu)
V_[T] <- rnorm(1, theta*G_[T], sqrt(sigma^2*G_[T]))
V_[1] <- 0
G_[1] <- 0
for (m in 2:M){
Y <- rbeta(1,T/(2^m*nu), T/(2^m*nu))
for (j in 2:2^(m-1)){
i <- (2*j-1)
G_[i*T/(2^m)] = G_[(i-1)*T/(2^m)]+(G_[(i-1)*T/(2^m)]+G_[(i+1)*T/(2^m)])*Y
b=G_[T*(i)/2^m] - G_[T*(i-1)/2^m]
Z_i <- rnorm(1, sd=b*sigma^2*Y)
V_[i] <- Y* V_[i+1] + (1-Y)*V_[i-1] + Z_i
}
}
V_ <- V_ + vgc*T_ # changes due to interest rate
V_
ts.plot(V_, main="BRIDGE", xlab="Time increment")
However this is how my plot from my ouput, in figure 1:
Bu as Variance gamma is a jump process with finite activity, the path should look like this: , this is just an image from google for variance gamma paths, the sequential sampling one looks like this and my aim is to compare it to Bridge sampling for simulating paths. But my output looks really different. Please let me know your thoughts. If there is an issue in my code let me know thanks. Here is algortihm for it, much similar to the one above but slightly different:
I notice searching through stackoverflow for similar questions that this has been asked several times hasn't really been properly answered. Perhaps with help from other users this post can be a helpful guide to programming a numerical estimate of the parameters of a multivariate normal distribution.
I know, I know! The closed form solutions are available and trivial to implement. In my case I am interested in modifying the likelihood function for a specific purpose and I don't expect an exact analytic solution so this is a test case to check the procedure.
So here is my attempt. Please comment. Especially if I am missing opportunities for optimization. Note, I'm not a statistician so I'd appreciate any pointers.
ll_multN <- function(theta,X) {
# theta = c(mu, diag(Sigma), Sigma[upper.tri(Sigma)])
# X is an nxk dataset
# MLE: L = - (nk/2)*log(2*pi) - (n/2)*log(det(Sigma)) - (1/2)*sum_i(t(X_i-mu)^2 %*% Sigma^-1 %*% (X_i-mu)^2)
# summation over i is performed using a apply call for efficiency
n <- nrow(X)
k <- ncol(X)
# def mu
mu.vec <- theta[1:k]
# def Sigma
Sigma.diag <- theta[(k+1):(2*k)]
Sigma.offd <- theta[(2*k+1):length(theta)]
Sigma <- matrix(NA, k, k)
Sigma[upper.tri(Sigma)] <- Sigma.offd
Sigma <- t(Sigma)
Sigma[upper.tri(Sigma)] <- Sigma.offd
diag(Sigma) <- Sigma.diag
# compute summation
sum_i <- sum(apply(X, 1, function(x) (matrix(x,1,k)-mu.vec)%*%solve(Sigma)%*%t(matrix(x,1,k)-mu.vec)))
# compute log likelihood
logl <- -.5*n*k*log(2*pi) - .5*n*log(det(Sigma))
logl <- logl - .5*sum_i
return(-logl)
}
Simulated dataset generated using the rmvnorm() function in the package "mvtnorm". Random positive definite covariance matrix generated using the additional function Posdef() (taken from here: https://stat.ethz.ch/pipermail/r-help/2008-February/153708)
library(mvtnorm)
Posdef <- function (n, ev = runif(n, 0, 5)) {
# generates a random positive definite covariance matrix
Z <- matrix(ncol=n, rnorm(n^2))
decomp <- qr(Z)
Q <- qr.Q(decomp)
R <- qr.R(decomp)
d <- diag(R)
ph <- d / abs(d)
O <- Q %*% diag(ph)
Z <- t(O) %*% diag(ev) %*% O
return(Z)
}
set.seed(2)
n <- 1000 # number of data points
k <- 3 # number of variables
mu.tru <- sample(0:3, k, replace=T) # random mean vector
Sigma.tru <- Posdef(k) # random covariance matrix
eigen(Sigma.tru)$val # check positive def (all lambda > 0)
# Generate simulated dataset
X <- rmvnorm(n, mean=mu.tru, sigma=Sigma.tru)
# initial parameter values
pars.init <- c(mu=rep(0,k), sig_ii=rep(1,k), sig_ij=rep(0, k*(k-1)/2))
# limits for optimization algorithm
eps <- .Machine$double.eps # get a small value for bounding the paramter space to avoid things such as log(0).
lower.bound <- c(rep(-Inf,k), # bound on mu
rep(eps,k), # bound on sigma_ii
rep(-Inf,k)) # bound on sigma_ij i=/=j
upper.bound <- c(rep(Inf,k), # bound on mu
rep(100,k), # bound on sigma_ii
rep(100,k)) # bound on sigma_ij i=/=j
system.time(
o <- optim(pars.init,
ll_multN, X=X, method="L-BFGS-B",
lower = lower.bound,
upper = upper.bound)
)
plot(x=c(mu.tru,diag(Sigma.tru),Sigma.tru[upper.tri(Sigma.tru)]),
y=o$par,
xlab="Parameter",
ylab="Estimate",
pch=20)
abline(c(0,1), col="red", lty=2)
This currently runs on my laptop in
user system elapsed
47.852 24.014 24.611
and gives this graphical output:
Estimated mean and variance
In particular any advice on limit setting or algorithm choice would be much appreciated.
Thanks
The following question we need to solve.
Consider the following binomial probability mass function (pmf):
f(x;m,p) = (m¦x) p^x * (1-p)^(m-x), for x = 0, 1, 2,.....,m,
and otherwise equal to 0. Let X_1, X_2,....,Xn be independent and identically distributed random samples from f(x;m = 20; p = 0:45).
1) Assume n = 15 and calculate the 95% confidence interval on p using the p-hat = Σ_(i=1)^n X_i/mn (an estimator of p). Simulate these confidence intervals 10000 times and
count how often the parameter value p lies within these 10000 confidence intervals.
m <- 20
p <- 0.45
n <- 15
x <- m
nsim <- 10000
counter <- 0
for (i in 1:nsim) {
bpmf <- rbinom(x,m,p)
esti_p <- bpmf/(m*n)
var_bpmf <- var(bpmf)
CI_lower <- esti_p - qnorm(0.975)*sqrt(var_bpmf/n)
CI_upper <- esti_p + qnorm(0.975)*sqrt(var_bpmf/n)
if ((CI_lower<p) & (CI_upper>p)) counter <- counter + 1
}
It doesn't work properly and I don't see what I'm doing wrong. Is there anyone who can help me with this?
When I run my code, I believe the answer now is right, but it gives the following sentence: "There were 50 or more warnings (use warnings() to see the first 50)" When I run this it will give:
"1: In if ((CI_lower < p) & (CI_upper > p)) counter <- counter + ... :
the condition has length > 1 and only the first element will be used".
Also I don't know for sure if;
CI_lower <- esti_p - qnorm(0.975)*sqrt(var_bpmf/n)
CI_upper <- esti_p + qnorm(0.975)*sqrt(var_bpmf/n)
is the right formula to calculate the confidence interval.
m <- 20
p <- 0.45
nsim <- 10000
bpmf <- rbinom(size=m,prob=p,n=nsim)
esti_p <- bpmf/m
var_bpmf <- esti_p*(1-esti_p)/m
CI_lower <- esti_p - qnorm(0.975)*sqrt(var_bpmf)
CI_upper <- esti_p + qnorm(0.975)*sqrt(var_bpmf)
counter <-((CI_lower<p) & (CI_upper>p))
table(counter)