pwr.chisq.test error in R - r

I am now trying to estimate the sample size needed for A/B testing of website conversion rate. pwr.chisq.test always gives me error message, when I have small value of conversion rate:
# conversion rate for two groups
p1 = 0.001
p2 = 0.0011
# degree of freedom
df = 1
# effect size
w = ES.w1(p1,p2)
pwr.chisq.test(w,
df = 1,
power=0.8,
sig.level=0.05)
**Error in uniroot(function(N) eval(p.body) - power, c(1 + 1e-10, 1e+05)) :
f() values at end points not of opposite sign**
However, if I have larger value for p1 and p2, this code works fine.
# conversion rate for two groups
p1 = 0.01
p2 = 0.011
# degree of freedom
df = 1
# effect size
w = ES.w1(p1,p2)
pwr.chisq.test(w,
df = 1,
power=0.8,
sig.level=0.05)
Chi squared power calculation
w = 0.01
N = 78488.61
df = 1 sig.level = 0.05
power = 0.8
NOTE: N is the number of observations

I think there is a "numerical" explanation to that. If you take a look at the function's code, you can see that the number of samples is computed by uniroot and is supposed to belong to an interval whose boundaries are set to 1e-10 and 1e5. The error message states that this interval does not give you the result: in your case, the upper limit is too small.
Knowing that, we can simply take a wider interval:
w <- 0.00316227766016838
k <- qchisq(0.05, df = 1, lower = FALSE)
p.body <- quote(pchisq(k, df = 1, ncp = N * w^2, lower = FALSE))
N <- uniroot(function(N) eval(p.body) - 0.8, c(1 + 1e-10, 1e+7))$root
The "solution" is N=784886.1... that's a huge number of observations.

Related

Can I plot the ANOVA model for a two or three factor experiment?

I want to fit a model to a three factor factorial experiment. In an attempt to do this with R, I am reproducing examples from a textbook (Montgomery, DC (2013) Design and Analysis of Experiments, 8th ed. John Wiley & Sons ISBN: 9781118097939). The specific example I am attempting is Example 5.5, and although only a two factor example, I am hoping to learn the basics from it.
I can easily reproduce the ANOVA table in R, and I can retract the coefficients of the model (I think). Considering the model equation given on the image above, I assume that the four coefficients returned by R is β0, β1, β2 and β12. I have no idea how to plot the surface described by the model, which is my first problem. Secondly, the textbook discuss how a better model fit can be attained if the interaction parameters, i.e. β112, β122 and β1122 are included. Is it possible to do this in R as well? The surface fitted to the model including the interaction parameters is attached here.
I am relatively comfortable in python, although I have never plotted surfaces using matplotlib. I am very new in R, and have never plotted anything in R. From surfing the web I could not find anything useful for what I am trying to do. My code is attached below.
lewensduur_data <- data.frame(A = rep(c(15, 20, 25), each = 2),
B = rep(c(125, 150, 175), each = 6),
lewe = c(-2, -1, 0, 2, -1, 0,
-3, 0, 1, 3, 5, 6,
2, 3, 4, 6, 0, -1))
lewensduur_anova <- aov(lewe ~ A * B, data = lewensduur_data)
lewensduur_anova
which yields the ANOVA table
Call:
aov(formula = lewe ~ A * B, data = lewensduur_data)
Terms:
A B A:B Residuals
Sum of Squares 8.33333 21.33333 8.00000 86.33333
Deg. of Freedom 1 1 1 14
Residual standard error: 2.483277
Estimated effects may be unbalanced
I retrieved the coefficients as follows
coefficients(lewensduur_anova)
yielding
(Intercept)-34A1.36666666666667B0.213333333333333A:B-0.008
As an after thought, I noticed that aov() returns that the estimated effects may be unbalanced. From what I understand, aov() is best suited for factors having the same amount of levels and replicates. Is there a better ANOVA function to use for cases like my example?
There are many ways you can make this happen. I'm going to start with what you've done so far, though.
Your aov() reflects that you have 1 degree of freedom for A and 1 for B. That's a sign that something is wrong. The degrees of freedom for a categorical field is going to be the number of unique values minus one. The degrees of freedom for both A and B should be two.
Let's go over what went wrong...your entries from A and B are numbers, so the aov function did not interpret them correctly. The easiest way to fix this is to make these two columns factor-type.
av <- aov(lewe ~ A * B, data = mutate(lewensduur_data, A = as.factor(A), B = as.factor(B)))
summary(av)
# Df Sum Sq Mean Sq F value Pr(>F)
# A 2 24.33 12.167 8.423 0.00868 **
# B 2 25.33 12.667 8.769 0.00770 **
# A:B 4 61.33 15.333 10.615 0.00184 **
# Residuals 9 13.00 1.444
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
These numbers are quite a bit different than your original numbers. You still need to ask yourself if this information is accurate. Well, ANOVA requires that the data meet the assumptions of both normal distribution and homogeneity. Normality and homogeneity where there are only two observations in each category? Your results are not going to be meaningful.
This is not enough data to evaluate with ANOVA.
However, for the sake of your original questions, let's look at this data using the jmv package. Note that I didn't make A or B factors for this call. Additionally, I used the sum of squares type 2 method (this is actually irrelevant, though.)
an <- ANOVA(formula = lewe ~ A * B,
data = lewensduur_data, ss = "2", effectSize = "partEta",
homo = T, norm = T, postHocCorr = 'tukey', postHoc = ~A * B)
an # fails for homogeneity; passes for normality
#
# ANOVA
#
# ANOVA - lewe
# ───────────────────────────────────────────────────────────────────────────────────────────
# Sum of Squares df Mean Square F p η²p
# ───────────────────────────────────────────────────────────────────────────────────────────
# A 24.33333 2 12.166667 8.423077 0.0086758 0.6517857
# B 25.33333 2 12.666667 8.769231 0.0077028 0.6608696
# A:B 61.33333 4 15.333333 10.615385 0.0018438 0.8251121
# Residuals 13.00000 9 1.444444
# ───────────────────────────────────────────────────────────────────────────────────────────
#
#
# ASSUMPTION CHECKS
#
# Homogeneity of Variances Test (Levene's)
# ────────────────────────────────────────────
# F df1 df2 p
# ────────────────────────────────────────────
# 1.163368e+31 8 9 < .0000001
# ────────────────────────────────────────────
#
#
# Normality Test (Shapiro-Wilk)
# ─────────────────────────────
# Statistic p
# ─────────────────────────────
# 0.9209304 0.1342718
# ─────────────────────────────
If we ignored the insufficient data sample size, we still can't trust this information. ANOVA is very robust against deviations from normal distribution, but it's very sensitive to homogeneity. (Your homogeneity was never going to pass due to the sample size when compared to the number of groups.)
To show you how you can plot this, I used the plotly package. To use this package for a surface plot, you need a matrix that essentially looks like the table in your book. (Where the speeds are the column names and the angles are the row names. You can only have one entry for each row/column, as well. I chose to use the average of the values by group. To collect the averages I used lapply.
# get the average by speed and angle
new_lewe <- lapply(seq(1, (nrow(lewensduur_data) - 1), by = 2),
function(i) {
with(lewensduur_data[i:(i + 1), ], mean(lewe))
}) %>% unlist()
I placed these averages into a matrix and gave it row and column names.
ld <- matrix(data = new_lewe,
nrow = 3, ncol = 3,
dimnames = with(lewensduur_data, list(unique(A), unique(B))))
You can see a basic surface plot now.
plot_ly(x = colnames(ld), y = rownames(ld),
z = ld) %>% add_surface()
You can dress this surface plot up to look more like your book's surface plot, as well.
plot_ly(x = colnames(ld), y = rownames(ld),
z = ld) %>%
add_surface(contours = list(z = list(usecolormap = T, show = T,
project = list(z = T)))) %>%
layout(scene = list(aspectmode = "manual",
aspectratio = list(x = 2, y = 1.5, z = 1),
xaxis = list(title = "Cutting Speed"),
yaxis = list(title = "Tool Angle"),
zaxis = list(title = "Life"),
camera = list(
center = list(x = .5, y = .5, z = .4),
eye = list(x = 1.5, y = 2, z = 2))))
You can project this data to a basic contour plot, as well.
plot_ly(x = colnames(ld), y = rownames(ld),
z = ld, type = "contour", line = list(width = 0))
In that basic contour plot (above) the x & y are opposite of those in your book.
In the next contour plot, I've flipped the axes, so they match. You can dress it up, as well.
plot_ly(y = colnames(ld), x = rownames(ld),
z = t(ld), type = "contour", ncontours = 8,
line = list(width = 0, smoothing = 1.3),
contours = list(showlabels = T)) %>%
layout(xaxis = list(title = "Angle"),
yaxis = list(title = "Speed"))
If you have any questions, let me know!

Simulating an AR(2) Process in R

I need to simulate an AR(2) process Y[t]=1/20+(Sqrt(3)/2)Y[t-1]-(1/4)Y[t-2]+e[t]
e[t]~(0,0.02^2)
Simulation has to be over 30 years where the model is measured in quarters.
I've tried with x <- arima.sim(model = list(order = c(2, 0, 0), ar = c(a1, a2)), n = 120, n.start = 100, sd = 0.02)
Using the above, R says the model isn't stationary.
Where a1 and a2 are equal to phi 1 and phi 2 in the model, but I can't figure out how to add Phi 0, or how to set values for y0=0.1 and y-1 = 0.12 which is required.
I've also tried the following
set.seed(9029) # set a seed to fix the simulated numbers
nsim = 1 # no. of simulations
burn = 100 # burn-in periods
n = 220 # sample length + burn-in periods --> sample length = 4quarters*30yrs
tp=(burn+1):n # time points to be sampled
sigerr = 0.02 # error s.d.
a1 = (sqrt(3)/2) # AR(2) coefficient
a2 = 0.25 # AR(2) coefficient = 1/4
a0 = 1/20 # Phi 0
# create data series and error series
y = array(0,c(n,nsim)) # data series
err = array(rnorm(n*nsim,0,sigerr),c(n,nsim)) # iid errors
# simulate y from an AR(2) process
for (k in 1:nsim) {
for (i in 2:n) {
y[i,k] = a0 + a1*y[i-1,k] + a2*y[i-2,k] + err[i,k]
}
}
But keep getting replacement has length zero as an error, and also I still can't find out how to add values for y0 and y-1 equal to 0.1 and 0.12 respectively. Please help I can't seem to find a fix. Thanks.

How to Determine the Confidence Interval for a WEIGHTED Population Proportion with R

TASK: COMPUTE THE WEIGHTED SHARES OF MANIFESTATIONS OF A CATEGORICAL VARIABLE AND FIND CONFIDENCE INTERVALS AROUND THOSE WEIGHTED SHARES
library(dplyr)
set.seed(100)
Make up data set with a categorical variable and a weight variable:
df <- data.frame(
Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)
Have a quick look at the data
head(df, 10)
tail(df, 10)
Now, I CAN complete the task without taking the weights into account:
Write function that returns the UNWEIGHTED share of one manifestation of the categorical variable together with its 95% confidence interval
(for general information regarding the determination of confidence intervals for a population proportion, see e.g.:
https://www.dummies.com/education/math/statistics/how-to-determine-the-confidence-interval-for-a-population-proportion/)
ci.share <- function(category, manifestation){
n = length(category)
share = length(which(category == manifestation)) / n
se = sqrt (share*(1-share) / n)
if(n*share*(1-share) >= 9){
U <- share - 1.96*se
O <- share + 1.96*se
KI <- c(U, share, O)
names(KI) <- c("lower boundary", "share", "upper boundary")
KI = KI * 100
return(KI)
} else {return("Error, because n*p*(1-p) < 9")}
}
Utilize function for all manifestations and store results in list:
cis <- list()
for(i in c("A", "B", "C", "D")){
cis[[i]] <- ci.share(category = df$Category, manifestation = i)
}
Make results easily readable:
cis <- t(sapply(cis, "["))
cis <- round(cis, digits = 2)
cis #TASK DONE
THE QUESTION IS: HOW TO GET THE EQUIVALENT FOR "cis" CONSIDERING THE WEIGHTS
What I can do, is finding the weighted shares:
ws <- summarise_at(group_by(df, Category), vars(Weight), funs(sum))
ws[,2] <- (ws[,2]/sum(df$Weight)) * 100
names(ws) <- c("Category", "Weighted_Share")
sum(ws$Weighted_Share) # Should be 100, is 100
ws
But how to get the Confidence Intervals now?
Would be very grateful for a solution. Thanks in advance!
Andi
For comparison purposes with the survey-based calculation of the proportion Confidence Intervals (CI), I here post the code to compute the CIs using the Central Limit Theorem (CLT) for non-identically distributed variables (as requested by the Original Poster (OP) in his comments):
1) Helper functions definition
#' Compute survey-based Confidence Intervals
#'
#' #param df data frame with at least one column: "Category".
#' #param design survey design, normally created with \code{svydesign()} of the survey package.
#' #details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using the \code{svymean()} function of the survey package on the given \code{design}.
#' #return data frame containing estimated proportions for each "Category" and respective
#' 95% confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_survey <- function(df, design) {
design.mean = svymean(~Category, design, deff="replace")
CI = confint( design.mean )
# Add the estimated proportions and Delta = Semi-size CI for comparison with the CLT-based results below
CI = cbind( p=design.mean, Delta=( CI[,"97.5 %"] - CI[,"2.5 %"] ) / 2, CI )
return(as.data.frame(CI))
}
#' Compute CLT-based Confidence Intervals
#'
#' #param df data frame with at least two columns: "Category", "Weight" (sampling weights)
#' #details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using Lyapunov's or Lindenberg's version of the Central Limit Theorem (CLT) which
#' applies to independent but NOT identically distributed random variables
#' Ref: https://en.wikipedia.org/wiki/Central_limit_theorem#Lack_of_identical_distribution
#' #return data frame containing estimated proportions for each "Category" and respective
#' \code{ci_level} confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_clt <- function(df, ci_level) {
### The following calculations are based on the following setup for each category given in 'df':
### 1) Let X(i) be the measurement of variable X for sampled case i, i = 1 ... n (n=500 in this case)
### where X is a 0/1 variable indicating absence or presence of a selected category.
### From the X(i) samples we would like to estimate the
### true proportion p of the presence of the category in the population.
### Therefore X(i) are iid random variables with Binomial(1,p) distribution
###
### 2) Let Y(i) = w(i)*X(i)
### where w(i) is the sampling weight applied to variable X(i).
###
### We apply the CLT to the sum of the Y(i)'s, using:
### - E(Y(i)) = mu(i) = w(i) * E(X(i)) = w(i) * p (since w(i) is a constant and the X(i) are identically distributed)
### - Var(Y(i)) = sigma2(i) = w(i)^2 * Var(X(i)) = w(i)^2 * p*(1-p) (since the X(i) iid)
###
### Hence, by CLT:
### Sum{Y(i) - mu(i)} / sigma -> N(0,1)
### where:
### sigma = sqrt( Sum{ sigma2(i) } ) = sqrt( Sum{ w(i)^2 } ) * sqrt( p*(1-p) )
### and note that:
### Sum{ mu(i) } = Sum{ w(i) } * p = n*p
### since the sampling weights are assumed to sum up to the sample size.
###
### Note: all the Sums are from i = 1, ..., n
###
### 3) Compute the approximate confidence interval for p based on the N(0,1) distribution
### in the usual way, by first estimating sigma replacing p for the estimated p.
###
alpha = 1 - ci_level # area outside the confidence band
z = qnorm(1 - alpha/2) # critical z-quantile from Normal(0,1)
n = nrow(df) # Sample size (assuming no missing values)
ws = df$Weight / sum(df$Weight) * n # Weights scaled to sum the sample size (assumed for sampling weights)
S = aggregate(ws ~ Category, sum, data=df) # Weighted-base estimate of the total by category (Sum{ Y(i) })
sigma2 = sum( ws^2 ) # Sum of squared weights (note that we must NOT sum by category)
S[,"p"] = S[,"ws"] / n # Estimated proportion by category
S[,"Delta"] = z * sqrt( sigma2 ) *
sqrt( S$p * (1 - S$p) ) / n # Semi-size of the CI by category
LB_name = paste(formatC(alpha/2*100, format="g"), "%") # Name for the CI's Lower Bound column
UB_name = paste(formatC((1 - alpha/2)*100, format="g"), "%") # Name for the CI's Upper Bound column
S[,LB_name] = S[,"p"] - S[,"Delta"] # CI's Lower Bound
S[,UB_name] = S[,"p"] + S[,"Delta"] # CI's Upper Bound
return(S)
}
#' Show the CI with the specified number of significant digits
show_values <- function(values, digits=3) {
op = options(digits=digits)
print(values)
options(op)
}
2) Simulated Data
### Simulated data
set.seed(100)
df <- data.frame(
Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)
3) Survey-based CI (for reference)
# Computation of CI using survey sampling theory (implemented in the survey package)
library(survey)
design <- svydesign(ids = ~1, weights = ~Weight, data = df)
CI_survey = ci_survey(df, design)
show_values(CI_survey)
which gives:
p Delta 2.5 % 97.5 %
CategoryA 0.0896 0.0268 0.0628 0.116
CategoryB 0.2119 0.0417 0.1702 0.254
CategoryC 0.2824 0.0445 0.2379 0.327
CategoryD 0.4161 0.0497 0.3664 0.466
4) CLT-based CI
The description of the method used is included in the comments at the top of the ci_clt() function defined above.
# Computation of CI using the Central Limit Theorem for non-identically distributed variables
CI_clt = ci_clt(df, ci_level=0.95)
show_values(CI_clt)
which gives:
Category ws p Delta 2.5 % 97.5 %
1 A 44.8 0.0896 0.0286 0.061 0.118
2 B 106.0 0.2119 0.0410 0.171 0.253
3 C 141.2 0.2824 0.0451 0.237 0.328
4 D 208.0 0.4161 0.0494 0.367 0.465
5) Comparison of CI sizes
Here we compute the ratio between the CLT-based CIs and the survey-based CIs.
# Comparison of CI size
show_values(
data.frame(Category=CI_clt[,"Category"],
ratio_DeltaCI_clt2survey=CI_clt[,"Delta"] / CI_survey[,"Delta"])
)
which gives:
Category ratio_DeltaCI_clt2survey
1 A 1.067
2 B 0.982
3 C 1.013
4 D 0.994
Since all the ratios are close to 1, we conclude that the CI sizes from the two methods are very similar!
6) Check that the CLT-based implementation for the CI seems to be correct
An convenient check to perform on the CLT-based calculation of the CIs is to run the calculation on the Simple Random Sample (SRS) case and verify that the results coincide with those given by the svymean() calculation under the SRS design.
# CLT-based calculation
df_noweights = df
df_noweights$Weight = 1 # SRS: weights equal to 1
show_values( ci_clt(df_noweights, ci_level=0.95) )
which gives:
Category p Delta 2.5 % 97.5 %
1 A 0.1 0.0263 0.0737 0.126
2 B 0.2 0.0351 0.1649 0.235
3 C 0.3 0.0402 0.2598 0.340
4 D 0.4 0.0429 0.3571 0.443
which we compare to the survey-based calculation:
# Survey-based calculation
design <- svydesign(ids=~1, probs=NULL, data=df)
show_values( ci_survey(df, design) )
that gives:
p Delta 2.5 % 97.5 %
CategoryA 0.1 0.0263 0.0737 0.126
CategoryB 0.2 0.0351 0.1649 0.235
CategoryC 0.3 0.0402 0.2598 0.340
CategoryD 0.4 0.0430 0.3570 0.443
We see that the results coincide, suggesting that the implementation of the CLT-based calculation of the CI in the general case with unequal weights is correct.
I finally found a very easy solution for my problem. The package "survey" does just what I want:
set.seed(100)
library(survey)
df <- data.frame(
Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)
d <- svydesign(id = ~1, weights = ~Weight, data = df)
svymean(~Category, d)
The code results in the weighted proportions (same results as in "ws") and the corresponding standard errors, which makes it easy to calculate the confidence intervals.

Equivalent of Matlab's 'fit' for Gaussian mixture models in R?

I have some time series data that looks like this:
x <- c(0.5833, 0.95041, 1.722, 3.1928, 3.941, 5.1202, 6.2125, 5.8828,
4.3406, 5.1353, 3.8468, 4.233, 5.8468, 6.1872, 6.1245, 7.6262,
8.6887, 7.7549, 6.9805, 4.3217, 3.0347, 2.4026, 1.9317, 1.7305,
1.665, 1.5655, 1.3758, 1.5472, 1.7839, 1.951, 1.864, 1.6638,
1.5624, 1.4922, 0.9406, 0.84512, 0.48423, 0.3919, 0.30773, 0.29264,
0.19015, 0.13312, 0.25226, 0.29403, 0.23901, 0.000213074755156413,
5.96565965097398e-05, 0.086874, 0.000926808687858284, 0.000904641782399267,
0.000513042259030044, 0.40736, 4.53928073402494e-05, 0.000765719624469057,
0.000717419263673946)
I would like to fit a curve to this data, using mixtures of one to five Gaussians. In Matlab, I could do the following:
fits{1} = fit(1:length(x),x,fittype('gauss1'));
fits{2} = fit(1:length(x),x,fittype('gauss2'));
fits{3} = fit(1:length(x),x,fittype('gauss3'));
... and so on.
In R, I am having difficulty identifying a similar method.
dat <- data.frame(time = 1:length(x), x = x)
fits[[1]] <- Mclust(dat, G = 1)
fits[[2]] <- Mclust(dat, G = 2)
fits[[3]] <- Mclust(dat, G = 3)
... but this does not really seem to be doing quite the same thing. For example, I am not sure how to calculate the R^2 between the fit curve and the original data using the Mclust solution.
Is there a simpler alternative in base R to fitting a curve using a mixture of Gaussians?
Function
With the code given below, and with a bit of luck in finding good initial parameters, you should be able to curve-fit Gaussian's to your data.
In the function fit_gauss, aim is to y ~ fit_gauss(x) and the number of Gaussians to use is determined by the length of the initial values for parameters: a, b, d all of which should be equal length
I have demonstrated curve-fitting of OP's data up to three Gaussian's.
Specifying Initial Values
This it pretty much most work I have done with nls (thanks to OP for that). So, I am not quite sure what is the best method select the initial values. Naturally, they depend on height's of peaks (a), mean and standard deviation of x around them (b and d).
One option would be for given number of Gaussian's, try with a number of starting values, and find the one that has best fit based on residual standard error fit$sigma.
I fiddled a bit to find initial parameters, but I dare say the parameters and
the plot with three Gaussian model looks solid.
Fitting one, two and thee Gaussian's to Example data
ind <- 1 : length(x)
# plot original data
plot(ind, x, pch = 21, bg = "blue")
# Gaussian fit
fit_gauss <- function(y, x, a, b, d) {
p_model <- function(x, a, b, d) {
rowSums(sapply(1:length(a),
function(i) a[i] * exp(-((x - b[i])/d[i])^2)))
}
fit <- nls(y ~ p_model(x, a, b, d),
start = list(a=a, b = b, d = d),
trace = FALSE,
control = list(warnOnly = TRUE, minFactor = 1/2048))
fit
}
Single Gaussian
g1 <- fit_gauss(y = x, x = ind, a=1, b = mean(ind), d = sd(ind))
lines(ind, predict(g1), lwd = 2, col = "green")
Two Gaussian's
g2 <- fit_gauss(y = x, x = ind, a = c(coef(g1)[1], 1),
b = c(coef(g1)[2], 30),
d = c(coef(g1)[1], 2))
lines(ind, predict(g2), lwd = 2, col = "red")
Three Gaussian's
g3 <- fit_gauss(y = x, x = ind, a=c(5, 4, 4),
b = c(12, 17, 11), d = c(13, 2, 2))
lines(ind, predict(g3), lwd = 2, col = "black")
Summery of fit with three Gaussian
summary(g3)
# Formula: x ~ p_model(ind, a, b, d)
#
# Parameters:
# Estimate Std. Error t value Pr(>|t|)
# a1 5.9307 0.5588 10.613 5.93e-14 ***
# a2 3.5689 0.7098 5.028 8.00e-06 ***
# a3 -2.2066 0.8901 -2.479 0.016894 *
# b1 12.9545 0.5289 24.495 < 2e-16 ***
# b2 17.4709 0.2708 64.516 < 2e-16 ***
# b3 11.3839 0.3116 36.538 < 2e-16 ***
# d1 11.4351 0.8568 13.347 < 2e-16 ***
# d2 1.8893 0.4897 3.858 0.000355 ***
# d3 1.0848 0.6309 1.719 0.092285 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.7476 on 46 degrees of freedom
#
# Number of iterations to convergence: 34
# Achieved convergence tolerance: 8.116e-06

samplesize package in R, understanding the parameters

Small Disclaimer: I considered posting this on cross-validated, but I feel that this is more related to a software implementation. The question can be migrated if you disagree.
I am trying out the package samplesize. I am trying to decipher what the k parameter for the function n.ttest is. The following is stated in the documentation:
k Sample fraction k
This is not very helpful. What exactly is this parameter?
I am performing the following calculations, all the essential values are in the vals variable, which I provide below:
power <- 0.90
alpha <- 0.05
vals <- ??? # These values are provided below
mean.diff <- vals[1,2]-vals[2,2]
sd1 <- vals[1,3]
sd2 <- vals[2,3]
k <- vals[2,4]/(vals[1,4]+vals[2,4])
design <- "unpaired"
fraction <- "unbalanced"
variance <- "equal"
# Get the sample size
n.ttest(power = power, alpha = alpha, mean.diff = mean.diff,
sd1 = sd1, sd2 = sd2, k = k, design = design,
fraction = fraction, variance = variance)
vals contains the following values:
> vals
affected mean sd length
1 1 -0.8007305 7.887657 57
2 2 4.5799913 6.740781 16
Is k the proportion of one group, in the total number of observations? Or is it something else? If I am correct, then does the proportion correspond to group with sd1 or sd2?
Your first instinct was right -- this belongs on stats.SE and not on SO. The parameter k has a statistical interpretation which can be found in any reference on power analysis. It basically sets the sample size of the second sample, when, as in the case of two-sample tests, the second sample is constrained to be a certain fraction of the first.
You can see the relevant lines of the code here (lines 106 to 120 of n.ttest):
unbalanced = {
df <- n.start - 2
c <- (mean.diff/sd1) * (sqrt(k)/(1 + k))
tkrit.alpha <- qt(conf.level, df = df)
tkrit.beta <- qt(power, df = df)
n.temp <- ((tkrit.alpha + tkrit.beta)^2)/(c^2)
while (n.start <= n.temp) {
n.start <- n.start + 1
tkrit.alpha <- qt(conf.level, df = n.start -
2)
tkrit.beta <- qt(power, df = n.start - 2)
n.temp <- ((tkrit.alpha + tkrit.beta)^2)/(c^2)
}
n1 <- n.start/(1 + k)
n2 <- k * n1
In your case:
library(samplesize)
vals = data.frame(
affected = c(1, 2),
mean = c(-0.8007305, 4.5799913),
sd = c(7.887657, 6.740781),
length = c(57, 16))
power <- 0.90
alpha <- 0.05
mean.diff <- vals[1,2]-vals[2,2]
sd1 <- vals[1,3]
sd2 <- vals[2,3]
k <- vals[2,4]/(vals[1,4]+vals[2,4])
k <- vals[2,4]/vals[1,4]
design <- "unpaired"
fraction <- "unbalanced"
variance <- "equal"
# Get the sample size
tt1 = n.ttest(power = power,
alpha = alpha,
mean.diff = mean.diff,
sd1 = sd1,
sd2 = sd2,
k = k,
design = design,
fraction = fraction,
variance = variance)
You can see that:
assertthat::are_equal(ceiling(tt1$`Sample size group 1`*tt1$Fraction),
tt1$`Sample size group 2`)

Resources