Finding Sample Size - r

I am attempting to use several methods (Wald, Wilson, Clopper-Pearson, Jeffreys, etc.) to calculate sample sizes for confidence intervals. I have been unable to find, in R, how to calculate these. Is there a better way to calculate these besides brute force? Does R have a package that will output all to compare?
I have been unsuccessful with the likes of n.clopper.pearson{GenBinomApps} and some of these require lots of by-hand computations. I have done this for the Wald method:
#Variables
z <- 1.95996
d <- .05
p <- 0.5
q <- 1 - p
#Wald
n_wald <- (z^2 * (p*q))/(d^2)
n_wald
But, I have not been able to find away, besides guess and check methods, to produce the others in R.

I was able to answer my own question with help from the comments:
n_wald <- ciss.wald(p, d, alpha = 0.05)
n_wilson <- ciss.wilson(p, d, alpha = 0.05)
n_agricoull <- ciss.agresticoull(p, d, alpha = 0.05)
These were from the binomSamSize package. Still struggling with an optimization for the clopper-pearson and jeffries if anyone can provide direction there, but these commands calculated sample size easily.

Related

How to use an equation with my data in R?

I am struggling with a portion of the data analysis for some research I have carried out. Other researchers have used an equation to estimate population growth rate that I would like to implement, but I am hitting a wall with trying to do so. Below is the equation:
Where N0 is the initial number of females in a cohort,
Ax in the number of females emerging on day X, Wx is a measure of mean female size on day x
per replicate, f(wx) is a function relating fecundity to female size, and D is the time (in days)
for a female to reproduce.
N0 (n=15) and D (7) are fixed numbers that I can put in the equation. f(wx) is a function that I have (y = 91.85x - 181.40). Below is a small sample of my data:
df <- data.frame(replicate = c('1','1','2','2','3','3','4','4'),
size = c(5.1, 4.9, 4.7, 4.6, 5.1,2.4,4.3,4.4),
day_emerging = c('6','7','6','7','6','8','7','6'))
I am sorry if this is a bad question for this site. I am just lost for how to handle this. I need R to be able to do the equation for different days. I'm not sure if that is actually possible with my current data format, because R will have to figure out how many females emerged on day x and then perform the other calculations for that day. So maybe this is impossible.
Thank you very much for any advice you can offer.
Here is a base R solution. Hope this is what you are after
dfs <- split(df,df$day_emerging)
p <- sum(sapply(dfs, function(v) nrow(v)*f(mean(v$size))))
q <- sum(sapply(dfs, function(v) nrow(v)*as.numeric(unique(v$day_emerging))*f(mean(v$size))))
res <- log(p/n)/(D + q/p)
such that
> res
[1] 0.5676656
DATA
n <- 15
D <- 7
f <- function(x) 91.85*x-181.4
df <- data.frame(replicate = c('1','1','2','2','3','3','4','4'),
size = c(5.1, 4.9, 4.7, 4.6, 5.1,2.4,4.3,4.4),
day_emerging = c('6','7','6','7','6','8','7','6'))
The answer to this is not particularly R-specific, but rather a skill in and of itself. What you want to do is translate a formal mathematical language into one that works in R (or Python or Matlab,etc).
This is a skill that's worth developing. In python-like psuedocode:
numerator = math.log((1 / n_0) * sum(A * f(w))
denominator = D + (sum(X * A * f(w)) / sum(A * f(w))
r_prime = numerator / denominator
As you can see, there's a lot of unknown variables that you'll have to set previously. Also, things f(w) will need to be defined as helper functions earlier in the script so they can be used. In general, you just want to be able to break down your equation into small parts that you can verify are correct.
It very much helps to do some unit testing with these things - package the equation as a function (or set of small functions that you'll use together) and feed it data that you've run through the equation and verified in another way - by hand, or by a more familiar package. This way, you only have to worry about expressing it in the correct syntax and will know when you've gotten everything correct.

Problem with numerical integration and optim R

I try to optimize a tricky density which involves a combination of integrate and optim in R for some interest parameters. My question is more about code than statistics, that's why I post here.
I made some researchs on Internet and I didn't find anything. So I tried to make some non-convinced tries. I would like to estimate my parameters Beta without b impact.
I've had some differents errors about integrate or optim.
Here a example of what I'm trying to do.
X <- matrix(c(1,1,1,1,1,56,54,32,12,9), nrow=5, ncol=2)
y <- matrix(c(0,1,1,1,0), nrow=5, ncol=1)
f <- function(beta){
g <- function(X,y,b){
(1/(1 + exp(-(X%*%beta + b))))^y - (1-(1/(1 + exp(-(X%*%beta + b)))))^(1-y)
}
integrate(Vectorize(g), lower = 0, upper = Inf,X=X, y=y)
}
optim(par=c(1,0), f, method="BFGS", hessian=TRUE)
I would like an estimate for my beta parameters with optim package.
I work on it since 1 week and I'm really struggle to have some estimates for my 2 parameters beta0 and beta1.
Different approaches for this estimation, like EM algorithm or Gauss-Hermite Quadrature are welcome.
Thanks for any help.
Loïc.

Data perturbation - How to perform it?

I am doing some projects related to statistics simulation using R based on "Introduction to Scientific Programming and Simulation Using R" and in the Students projects session (chapter 24) i am doing the "The pipe spiders of Brunswick" problem, but i am stuck on one part of an evolutionary algorithm, where you need to perform some data perturbation according to the sentence bellow:
"With probability 0.5 each element of the vector is perturbed, independently
of the others, by an amount normally distributed with mean 0 and standard
deviation 0.1"
What does being "perturbed" really mean here? I dont really know which operation I should be doing with my vector to make this perturbation happen and im not finding any answers to this problem.
Thanks in advance!
# using the most important features, we create a ML model:
m1 <- lm(PREDICTED_VALUE ~ PREDICTER_1 + PREDICTER_2 + PREDICTER_N )
#summary(m1)
#anova(m1)
# after creating the model, we perturb as follows:
#install.packages("perturb") #install the package
library(perturb)
set.seed(1234) # for same results each time you run the code
p1_new <- perturb(m1, pvars=c("PREDICTER_1","PREDICTER_N") , prange = c(1,1),niter=200) # your can change the number of iterations to any value n. Total number of iteration would come to be n+1
p1_new # check the values of p1
summary(p1_new)
Perturbing just means adding a small, noisy shift to a number. Your code might look something like this.
x = sample(10, 10)
ind = rbinom(length(x), 1, 0.5) == 1
x[ind] = x[ind] + rnorm(sum(ind), 0, 0.1)
rbinom gets the elements to be modified with probability 0.5 and rnorm adds the perturbation.

Does cattell's profile similarity coefficient (Rp) exist as a function in R?

i'm comparing different measures of distance and similarity for vector profiles (Subtest results) in R, most of them are easy to compute and/or exist in dist().
Unfortunately, one that might be interesting and is to difficult for me to calculate myself is Cattel's Rp. I can not find it in R.
Does anybody know if this exists already?
Or can you help me to write a function?
The formula (Cattell 1994) of Rp is this:
(2k-d^2)/(2k + d^2)
where:
k is the median for chi square on a sample of size n;
d is the sum of the (weighted=m) difference between the two profiles,
sth like: sum(m(x(i)-y(i)));
one thing i don't know is, how to get the chi square median in there
Thank you
What i get without defining the k is:
Rp.Cattell <- function(x,y){z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);return(z)}
Vector examples are:
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
They are measures by the same device, but related to different bodyparts. They don't need to be standartised or weighted, i would say.
This page gives a general formula for k, and then gives a more thorough method using SAS/IML which pretty much gives the same results. So I used the general formula, added calculation of degrees of freedom, which leads to this:
Rp.Cattell <- function(x,y) {
dof <- (2-1) * (length(y)-1)
k <- (1-2/(9*dof))^3
z <- (2*k-sum(sum(x-y))^2)/(2*k+sum(sum(x-y))^2)
return(z)
}
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
Rp.Cattell(x, y)
# [1] -0.9012083
Does this figure appear to make sense?
Trying to verify the function, I found out now that the median of chisquare is the chisquare value for 50% probability - relating to random. So the function should be:
Rp.Cattell <- function(x,y){
dof <- (2-1) * (length(y)-1)
k <- qchisq(.50, df=dof)
z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);
return(z)}
It is necessary though to standardize the Values before, so the results are distributed correctly.
So:
library ("stringr")
# they are centered already
x <- as.vector(scale(c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758),center=F, scale=T))
y <- as.vector(scale(c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925),center=F, scale=T))
Rp.Cattell(x, y) -0.584423
This sounds reasonable now - or not?
I consider calculation of z is incorrect.
You need to calculate the sum of the squared differences. Not the square of the sum of differences. Besides product operator is missing in 2k.
It should be
z <- (2*k-sum((x-y)^2))/(2*k+sum((x-y)^2))
Do you agree?

How extreme values of a functional can be found using R?

I have a functional like this :
(LaTex formula: $v[y]=\int_0^2 (y'^2+23yy'+12y^2+3ye^{2t})dt$)
with given start and end conditions y(0)=-1, y(2)=18.
How can I find extreme values of this functional in R? I realize how it can be done for example in Excel but didn't find appropriate solution in R.
Before trying to solve such a task in a numerical setting, it might be better to lean back and think about it for a moment.
This is a problem typically treated in the mathematical discipline of "variational calculus". A necessary condition for a function y(t) to be an extremum of the functional (ie. the integral) is the so-called Euler-Lagrange equation, see
Calculus of Variations at Wolfram Mathworld.
Applying it to f(t, y, y') as the integrand in your request, I get (please check, I can easily have made a mistake)
y'' - 12*y + 3/2*exp(2*t) = 0
You can go now and find a symbolic solution for this differential equation (with the help of a textbook, or some CAS), or solve it numerically with the help of an R package such as 'deSolve'.
PS: Solving this as an optimization problem based on discretization is possible, but may lead you on a long and stony road. I remember solving the "brachistochrone problem" to a satisfactory accuracy only by applying several hundred variables (not in R).
Here is a numerical solution in R. First the functional:
f<-function(y,t=head(seq(0,2,len=length(y)),-1)){
len<-length(y)-1
dy<-diff(y)*len/2
y0<-(head(y,-1)+y[-1])/2
2*sum(dy^2+23*y0*dy+12*y0^2+3*y0*exp(2*t))/len
}
Now the function that does the actual optimization. The best results I got were using the BFGS optimization method, and parametrizing using dy rather than y:
findMinY<-function(points=100, ## number of points of evaluation
boundary=c(-1,18), ## boundary values
y0=NULL, ## optional initial value
method="Nelder-Mead", ## optimization method
dff=T) ## if TRUE, optimizes based on dy rather than y
{
t<-head(seq(0,2,len=points),-1)
if(is.null(y0) || length(y0)!=points)
y0<-seq(boundary[1],boundary[2],len=points)
if(dff)
y0<-diff(y0)
else
y0<-y0[-1]
y0<-head(y0,-1)
ff<-function(z){
if(dff)
y<-c(cumsum(c(boundary[1],z)),boundary[2])
else
y<-c(boundary[1],z,boundary[2])
f(y,t)
}
res<-optim(y0,ff,control=list(maxit=1e9),method=method)
cat("Iterations:",res$counts,"\n")
ymin<-res$par
if(dff)
c(cumsum(c(boundary[1],ymin)),boundary[2])
else
c(boundary[1],ymin,boundary[2])
}
With 500 points of evaluation, it only takes a few seconds with BFGS:
> system.time(yy<-findMinY(500,method="BFGS"))
Iterations: 90 18
user system elapsed
2.696 0.000 2.703
The resulting function looks like this:
plot(seq(0,2,len=length(yy)),yy,type='l')
And now a solution that numerically integrates the Euler equation.
As #HansWerner pointed out, this problem boils down to applying the Euler-Lagrange equation to the integrand in OP's question, and then solving that differential equation, either analytically or numerically. In this case the relevant ODE is
y'' - 12*y = 3/2*exp(2*t)
subject to:
y(0) = -1
y(2) = 18
So this is a boundary value problem, best approached using bvpcol(...) in package bvpSolve.
library(bvpSolve)
F <- function(t, y.in, pars){
dy <- y.in[2]
d2y <- 12*y.in[1] + 1.5*exp(2*t)
return(list(c(dy,d2y)))
}
init <- c(-1,NA)
end <- c(18,NA)
t <- seq(0, 2, by = 0.01)
sol <- bvpcol(yini = init, yend = end, x = t, func = F)
y = function(t){ # analytic solution...
b <- sqrt(12)
a <- 1.5/(4-b*b)
u <- exp(2*b)
C1 <- ((18*u + 1) - a*(exp(4)*u-1))/(u*u - 1)
C2 <- -1 - a - C1
return(a*exp(2*t) + C1*exp(b*t) + C2*exp(-b*t))
}
par(mfrow=c(1,2))
plot(t,y(t), type="l", xlim=c(0,2),ylim=c(-1,18), col="red", main="Analytical Solution")
plot(sol[,1],sol[,2], type="l", xlim=c(0,2),ylim=c(-1,18), xlab="t", ylab="y(t)", main="Numerical Solution")
It turns out that in this very simple example, there is an analytical solution:
y(t) = a * exp(2*t) + C1 * exp(sqrt(12)*t) + C2 * exp(-sqrt(12)*t)
where a = -3/16 and C1 and C2 are determined to satisfy the boundary conditions. As the plots show, the numerical and analytic solution agree completely, and also agree with the solution provided by #mrip

Resources