Small Disclaimer: I considered posting this on cross-validated, but I feel that this is more related to a software implementation. The question can be migrated if you disagree.
I am trying out the package samplesize. I am trying to decipher what the k parameter for the function n.ttest is. The following is stated in the documentation:
k Sample fraction k
This is not very helpful. What exactly is this parameter?
I am performing the following calculations, all the essential values are in the vals variable, which I provide below:
power <- 0.90
alpha <- 0.05
vals <- ??? # These values are provided below
mean.diff <- vals[1,2]-vals[2,2]
sd1 <- vals[1,3]
sd2 <- vals[2,3]
k <- vals[2,4]/(vals[1,4]+vals[2,4])
design <- "unpaired"
fraction <- "unbalanced"
variance <- "equal"
# Get the sample size
n.ttest(power = power, alpha = alpha, mean.diff = mean.diff,
sd1 = sd1, sd2 = sd2, k = k, design = design,
fraction = fraction, variance = variance)
vals contains the following values:
> vals
affected mean sd length
1 1 -0.8007305 7.887657 57
2 2 4.5799913 6.740781 16
Is k the proportion of one group, in the total number of observations? Or is it something else? If I am correct, then does the proportion correspond to group with sd1 or sd2?
Your first instinct was right -- this belongs on stats.SE and not on SO. The parameter k has a statistical interpretation which can be found in any reference on power analysis. It basically sets the sample size of the second sample, when, as in the case of two-sample tests, the second sample is constrained to be a certain fraction of the first.
You can see the relevant lines of the code here (lines 106 to 120 of n.ttest):
unbalanced = {
df <- n.start - 2
c <- (mean.diff/sd1) * (sqrt(k)/(1 + k))
tkrit.alpha <- qt(conf.level, df = df)
tkrit.beta <- qt(power, df = df)
n.temp <- ((tkrit.alpha + tkrit.beta)^2)/(c^2)
while (n.start <= n.temp) {
n.start <- n.start + 1
tkrit.alpha <- qt(conf.level, df = n.start -
2)
tkrit.beta <- qt(power, df = n.start - 2)
n.temp <- ((tkrit.alpha + tkrit.beta)^2)/(c^2)
}
n1 <- n.start/(1 + k)
n2 <- k * n1
In your case:
library(samplesize)
vals = data.frame(
affected = c(1, 2),
mean = c(-0.8007305, 4.5799913),
sd = c(7.887657, 6.740781),
length = c(57, 16))
power <- 0.90
alpha <- 0.05
mean.diff <- vals[1,2]-vals[2,2]
sd1 <- vals[1,3]
sd2 <- vals[2,3]
k <- vals[2,4]/(vals[1,4]+vals[2,4])
k <- vals[2,4]/vals[1,4]
design <- "unpaired"
fraction <- "unbalanced"
variance <- "equal"
# Get the sample size
tt1 = n.ttest(power = power,
alpha = alpha,
mean.diff = mean.diff,
sd1 = sd1,
sd2 = sd2,
k = k,
design = design,
fraction = fraction,
variance = variance)
You can see that:
assertthat::are_equal(ceiling(tt1$`Sample size group 1`*tt1$Fraction),
tt1$`Sample size group 2`)
Related
I am implementing the Dirichlet Mixture Model using the EM algorithm in R, but am experiencing issues with the results. I generated two binomial distributions with fractions of (70%, 30%) and means of (0.05, 0.18), and trimmed 5% of the data set near 0. However, I am using a Beta distribution for clustering instead of a binomial distribution. Additionally, I am updating the mean and variance of the distributions rather than the alpha and beta parameters in order to impose constraints on the variance of each distribution.
I expected to obtain results similar to the ground truth settings, but instead I am getting pi values of (1, 0) and means of (0.09, 0.21). I am not sure if there are errors in my EM algorithm implementation or issues with parameter initialization.
I am including my R code for the data generation and DMM below. I would appreciate any help in identifying the cause of the problem and suggestions for how to resolve it.
library(dplyr)
library(data.table)
library(tidyverse)
set.seed(42)
#read count
cover <- 100
#Ground Truth Setting
subclone_f <- c(0.7, 0.3) # Ground Truth Setting - proportion
subclone_vaf <- c(0.05, 0.18) # Ground Truth Setting - mean
n_muts <- 45000
n_clone <-length(subclone_f)
#generating the virtual mutation notation: subclonal if 2, clonal if 1
mut_type <- sample.int(2, n_muts, prob = subclone_f, replace = TRUE)
mut_type
#generating negative binomial distribution(read count) for the given coverage
mut_reads <- rbinom(n_muts, cover, prob = subclone_vaf[mut_type]) %>% data.frame()
mut_reads
vaf <- (mut_reads/cover) %>% data.frame()
# Truncate the low count reads
n <- 0.95 * nrow(vaf) # cut-off setting
vaf_trim <- sapply(vaf, function(x) sort(x, decreasing = TRUE)[1:n])
colnames(vaf_trim) <- c("vaf")
hist(vaf_trim, breaks=seq(0,0.75,by=0.0001))
# Mixture Model
# Parameter Initialization (for 2 subclonality)
pi <- c(0.5, 0.5) # Mixture proportion weight: sums up to 1
alpha <- c(2,3)
beta <- c(20,5)
Mu[1] <- alpha[1] / (alpha[1] + beta[1])
Mu[2] <- alpha[2] / (alpha[2] + beta[2])
var[1] <- alpha[1]*beta[1] / ((alpha[1] + beta[1])^2 * (alpha[1] + beta[1] +1))
var[2] <- alpha[2]*beta[2] / ((alpha[2] + beta[2])^2 * (alpha[2] + beta[2] +1))
tau <-c(0.05, 0.05)
loglike[1] <- 0.5
loglike[2] <- 0.5
k <- 2
Nu <- 1/ (alpha + beta + 1) # control the variance: same across the distributions> Originally wanted to implement the same Nu for 2 distributions but I don't know how to do that.
n_cluster <- nrow(data.frame(pi))
logdbeta <- function(x, alpha, beta) {
sum(sapply(x, function(x) {dbeta(x, alpha, beta, log = TRUE)}))
}
estBetaParams <- function(mu, var) {
alpha <- ((1 - mu) / var - (1 / mu)) * mu ^ 2
beta <- alpha * (1 / mu - 1)
return(params = list(alpha = alpha, beta = beta))
}
# Loop for the EM algorithm
while(abs(loglike[k]-loglike[k-1]) >= 0.00001) {
# E step
total <- (pi[1]*dbeta(vaf_trim, alpha[1], beta[1])) + (pi[2]*dbeta(vaf_trim, alpha[2], beta[2]))
tau1 <- pi[1]*(dbeta(vaf_trim, alpha[1], beta[1]))/ total
tau2 <- pi[2]*(dbeta(vaf_trim, alpha[2], beta[2]))/ total
# M step
pi[1] <- sum(tau1)/length(vaf_trim) # Update Pi(weight)
pi[2] <- sum(tau2)/length(vaf_trim)
Mu[1] <- sum(tau1*vaf_trim)/sum(tau1) # Update Mu
Mu[2] <- sum(tau2*vaf_trim)/sum(tau2)
#Nu <- alpha + beta
Nu <- 1/ (alpha + beta + 1)
# Our main aim was to share the same coefficient for all dist
var[1] <- Mu[1] * (1-Mu[1]) * Nu[1] # Update Variance
var[2] <- Mu[2] * (1-Mu[2]) * Nu[2]
#Update in terms of alpha and beta
estBetaParams(Mu[1], var[1])
estBetaParams(Mu[2], var[2])
# Maximize the loglikelihood
loglike[k+1]<-sum(tau1*(log(pi[1])+logdbeta(vaf_trim,Mu[1],var[1])))+sum(tau2*(log(pi[2])+logdbeta(vaf_trim,Mu[2],var[2])))
k<-k+1
}
# Print estimates
EM <- data.table(param = c("pi", "Mean"), pi = pi, Mean = Mu)
knitr::kable(EM)
I am trying to figure out how to sample from a custom density in rJAGS but am running into issues. having searched the site, I saw that there is a zeroes (or ones) trick that can be employed based on BUGS code but am having a hard time with its implementation in rJAGS. I think I am doing it correctly but keep getting the following error:
Error in jags.model(model1.spec, data = list(x = x, N = N), n.chains = 4, :
Error in node dpois(lambda)
Length mismatch in Node::setValue
Here is my rJAGS code for reproducibility:
library(rjags)
set.seed(4)
N = 100
x = rexp(N, 3)
L = quantile(x, prob = 1) # Censoring point
censor = ifelse(x <= L, 1, 0) # Censoring indicator
x[censor == 1] <- L
model1.string <-"
model {
for (i in 1:N){
x[i] ~ dpois(lambda)
lambda <- -N*log(1-exp(-(1/mu)))
}
mu ~ dlnorm(mup, taup)
mup <- log(.0001)
taup <- 1/49
R <- 1 - exp(-(1/mu) * .0001)
}
"
model1.spec<-textConnection(model1.string)
jags <- jags.model(model1.spec,
data = list('x' = x,
'N' = N),
n.chains=4,
n.adapt=100)
Here, my negative log likelihood of the density I am interested in is -N*log(1-exp(-(1/mu))). Is there an obvious mistake in the code?
Using the zeros trick, the variable on the left-hand side of the dpois() relationship has to be an N-length vector of zeros. The variable x should show up in the likelihood somewhere. Here is an example using the normal distribution.
set.seed(519)
N <- 100
x <- rnorm(100, mean=3)
z <- rep(0, N)
C <- 10
pi <- pi
model1.string <-"
model {
for (i in 1:N){
lambda[i] <- pow(2*pi*sig2, -0.5) * exp(-.5*pow(x[i]-mu, 2)/sig2)
loglam[i] <- log(lambda[i]) + C
z[i] ~ dpois(loglam[i])
}
mu ~ dnorm(0,.1)
tau ~ dgamma(1,.1)
sig2 <- pow(tau, -1)
sumLL <- sum(log(lambda[]))
}
"
model1.spec<-textConnection(model1.string)
set.seed(519)
jags <- jags.model(model1.spec,
data = list('x' = x,
'z' = z,
'N' = N,
'C' = C,
'pi' = pi),
inits = function()list(tau = 1, mu = 3),
n.chains=4,
n.adapt=100)
samps1 <- coda.samples(jags, c("mu", "sig2"), n.iter=1000)
summary(samps1)
Iterations = 101:1100
Thinning interval = 1
Number of chains = 4
Sample size per chain = 1000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu 4.493 2.1566 0.034100 0.1821
sig2 1.490 0.5635 0.008909 0.1144
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu 0.6709 3.541 5.218 5.993 7.197
sig2 0.7909 0.999 1.357 1.850 2.779
I need to compare the variances of several independent samples. I don't have the data stored in vectors. I only know the mean, standard deviation and the sample count of each sample. Does anyone know a way to test whether the variances are aqual with only those three statistics in R?
Here is an implementation of the Bartlett test that doesn't require the samples only their sizes and standard errors or variances.
The arguments are
n a vector of sample sizes;
S a vector of standard errors or variances;
se a logical value, if TRUE argument S are the standard errors, if FALSE they are the variances.
Tested below with data set iris.
Bartlett_test <- function(n, S, se = TRUE){
dname <- deparse(substitute(S))
N <- sum(n)
k <- length(n)
S2 <- if(se) S^2 else S
S2p <- sum((n - 1)* S2)/(N - k)
numer <- (N - k)*log(S2p) - sum((n - 1)*log(S2))
denom <- 1 + (sum(1/(n - 1)) - 1/(N - k))/(3*(k - 1))
statistic <- c(X2 = numer/denom)
parameter <- k - 1
p.value <- pchisq(statistic, df = parameter, lower.tail = FALSE)
ht <- list(
statistic = statistic,
data.name = dname,
parameter = parameter,
p.value = p.value,
method = "Bartlett test of homogeneity of variances",
alternative = "there are at least two unequal variances"
)
class(ht) <- "htest"
ht
}
n <- with(iris, tapply(Sepal.Length, Species, FUN = length))
s <- with(iris, tapply(Sepal.Length, Species, FUN = sd))
s2 <- with(iris, tapply(Sepal.Length, Species, FUN = var))
Bartlett_test(n, s)
Bartlett_test(n, s2, se = FALSE)
I want to simulate a time series data that follows AR(1) with phi=0.6 such that if I tried my first simulation I will check if it follows the AR(1). If not, I will make the second trial, together with the first I will get the average of the two trials to form the series. I test the order until it conforms to the AR(1) otherwise I keep adding one (1) to my trials until I confirmed that the average of the trials is a time series of AR(1) model.
After that, I will check if the coefficient of the AR(1) is equal to phi=0.6. if not I will add yet one(1) to my trials until I check that the phi=0.6.
**MWE*
library(FitAR)
n=50
a=0.6
count=0
e <- rnorm(n+100)
x <- double(n+100)
x[1] <- rnorm(1)
for(i in 2:(n+100)) {
x[i] <- a * x[i-1] + e[i]
}
x <- ts(x[-(1:100)])
p=SelectModel(x, lag.max = 14, Criterion = "BIC", Best=1)
if(p >= 2){
count <- count + 1
mat <- replicate(count, x)
x <- as.ts(rowMeans(mat))
}
fit=arima(x,order = c(p,0,0))
my_coef=fit$coef
if(my_coef != 0.6){
mat <- replicate(count + 1, x)
x <- as.ts(rowMeans(mat))
}
my_coefficients=my_coef[!names(my_coef) == 'intercept']
print(my_coefficients)
print(paste0("AR(2) model count is: ", count_coef))
We can generate time series data for 100 time points of a white noise ARIMA(0, 0, 0) process with zero mean and standard deviation sd = 2 in the following way
set.seed(2020)
ts <- arima.sim(model = list(), n = 100, sd = 2)
This is explained in the documentation ?arima.sim
Usage:
arima.sim(model, n, rand.gen = rnorm, innov = rand.gen(n, ...),
n.start = NA, start.innov = rand.gen(n.start, ...),
...)
...: additional arguments for ‘rand.gen’. Most usefully, the
standard deviation of the innovations generated by ‘rnorm’
can be specified by ‘sd’.
To generate 50 time series, we can use replicate
set.seed(2020)
mat <- replicate(50, arima.sim(model = list(), n = 100, sd = 2))
The resulting object is a matrix with dimensions 100 x 50.
We can confirm that the standard deviation is indeed sd = 2
summary(apply(mat, 2, sd))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
#1.669 1.899 2.004 2.006 2.107 2.348
library(tidyverse)
apply(mat, 2, sd) %>%
enframe() %>%
ggplot(aes(value)) +
geom_histogram(bins = 10)
I have written a small function that simulates data from a normal distribution, how it is usual in linear models. My question is how to get a model with a pvalue of sim[, 1] == 0.05. I want to show that if I add a random variable even it is normal distributed around zero with small variance N(0,0.0023) , that pvalue of sim[,1] changes. The code below shows the true model.
set.seed(37) # seed for reproducability
simulation <- function(b_0, b_1,n,min_x_1 ,max_x_1,sd_e){
mat <- NA
x_1 <- runif(n = n, min = min_x_1, max =max_x_1)
error <- rnorm(mean = 0,sd = sd_e, n = n )
y <- b_0 + b_1*x_1 + error
mat <- matrix(cbind(x_1,y), ncol = 2)
return(mat)
#plot(mat[,1],mat[,2])
}
sim <- simulation(10,-2,10000,-10,70,0.003)
summary(lm(sim[,2] ~ sim[,1] ))