Related
I need help to calculate bootstrap-based credible intervals of the quantity qtt.ci from my regression coef.def.
So far my attempts have resulted in:
Error in quantile.default(s, c(0.025, 0.25, 0.5, 0.75, 0.975)) :
missing values and NaN's not allowed if 'na.rm' is FALSE
preceded by:
Warning message: In bayesboot(dat, boot_fn) : The sample from
bayesboot contains either NAs, NaNs or NULLs. Make sure that your
statistic function only return actual values.
Here are my sample data:
dat <- data.frame(
A = c(1, 1, 0, 0), B = c(1, 0, 1, 0),
Pass = c(278, 100, 153, 79), Fail = c(743, 581, 1232, 1731)
Below is my regression. The quantity I want to get the bootstrap-based 95% credible intervals is qtt.ci:
boot_fn <- function(dat) {
coef.def = unname(coef(glm(cbind(Pass, Fail) ~ A * B, binomial,
dat)))
}
qtt.ci <- exp(sum(coef.def[2:4])) - exp(coef.def[2]) - exp(coef.def[3]) + 1
Here is my attempt:
bb_ci <- bayesboot(dat, boot_fn)
summary(bb_ci)
Not certain how to get the bootstrap-based confidence intervals for qtt.ci.
Thank you in advance.
EDIT:
Following the answer by #RuiBarradas, I tried doing bootstrap to get the 95% CI for the quantity qtt.ci (which is the quantity for which I want to get the bootstrapped CI), but no success:
library(bayesboot)
boot_fn <- function(dat) {
coef.def <- unname(coef(glm(cbind(Pass, Fail) ~ A * B, binomial, dat)))
qtt<- (exp(sum(coef.def[2:4])) - exp(coef.def[2]) - exp(coef.def[3]) + 1)
if(all(!is.na(qtt))) qtt else NULL
}
Runs <- 1e2
qtt.ci <- bayesboot(dat, boot_fn, R = Runs, R2 = Runs)
summary(qtt.ci)
Quantiles:
statistic q2.5% q25% median q75% q97.5%
V1 2.705878 2.705878 2.705878 2.705878 2.705878
Therefore, this does not give the CI for qtt.ci. The output is simply the point estimate for qtt:
qtt<-(exp(sum(coef.def[2:4])) - exp(coef.def[2]) - exp(coef.def[3]) + 1)
qtt
[1] 2.705878
Any help would be much appreciated.
The following solves the warning issue. I have tested it with much less runs, instead of 4000 just 100.
library(bayesboot)
boot_fn <- function(dat) {
fit <- glm(cbind(Pass, Fail) ~ A * B, binomial, dat)
coef.def <- unname(coef(fit))
if(all(!is.na(coef.def))) coef.def else NULL
}
Runs <- 1e2
bb_ci <- bayesboot(dat, boot_fn, R = Runs, R2 = Runs)
summary(bb_ci)
Edit.
According to the formula in the question and the dialog in comments with the OP, to get the bootstrap-based CI run:
qtt <- exp(sum(bb_ci[2:4])) - exp(bb_ci[2]) - exp(bb_ci[3]) + 1
I want to estimate means and totals from a stratified sampling design in which single stage cluster sampling was used in each stratum. I believe I have the design properly specified using the svydesign() function of the survey package. But I'm not sure how to correctly specify the stratum weights.
Example code is shown below. I provide unadjusted stratum weights using the weights= argument. I expected that the estimate and the SE from svytotal() would be equal to the sum of the stratum weights (70, in the example) times the estimate and SE from svymean(). Instead the estimates differ by a factor of 530 (which is the sum of the stratum weights over all of the elements in the counts data) and the SEs differ by a factor of 898 (???). My questions are (1) how can I provide my 3 stratum weights to svydesign() in a way that it understands, and (2) why aren't the estimates and SEs from svytotal() and svymean() differing by the same factor?
library(survey)
# example data from a stratified sampling design in which
# single stage cluster sampling is used in each stratum
counts <- data.frame(
Stratum=rep(c("A", "B", "C"), c(5, 8, 8)),
Cluster=rep(1:8, c(3, 2, 3, 2, 3, 2, 3, 3)),
Element=c(1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 3),
Count = 1:21
)
# stratum weights
weights <- data.frame(
Stratum=c("A", "B", "C"),
W=c(10, 20, 40)
)
# combine counts and weights
both <- merge(counts, weights)
# estimate mean and total count
D <- svydesign(id=~Cluster, strata=~Stratum, weights=~W, data=both)
a <- svymean(~Count, D)
b <- svytotal(~Count, D)
sum(weights$W) # 70
sum(both$W) # 530
coef(b)/coef(a) # 530
SE(b)/SE(a) # 898.4308
First update
I'm adding a diagram to help explain my design. The entire population is a lake with known area (70 ha in this example). The strata have known areas, too (10, 20, and 40 ha). The number of clusters allocated to each stratum was not proportional. Also, the clusters are tiny relative to the number that could possibly be sampled, so the finite population correction is FPC = 1.
I want to calculate an overall mean and SE on a per unit area basis and a total that is equal to 70 times this mean and SE.
Second update
I wrote the code to do the calculations from scratch. I get a total estimate of 920 with se 61.6.
library(survey)
library(tidyverse)
# example data from a stratified sampling design in which
# single stage cluster sampling is used in each stratum
counts <- data.frame(
Stratum=rep(c("A", "B", "C"), c(5, 8, 8)),
Cluster=rep(1:8, c(3, 2, 3, 2, 3, 2, 3, 3)),
Element=c(1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 3),
Count = c(5:1, 6:21)
)
# stratum weights
areas <- data.frame(
Stratum=c("A", "B", "C"),
A_h=c(10, 20, 40)
)
# calculate cluster means
step1 <- counts %>%
group_by(Stratum, Cluster) %>%
summarise(P_hi = sum(Count), m_hi=n())
step2 <- step1 %>%
group_by(Stratum) %>%
summarise(
ybar_h = sum(P_hi) / sum(m_hi),
n_h = n(),
sh.numerator = sum((P_hi - ybar_h*m_hi)^2),
mbar_h = mean(m_hi)
) %>%
mutate(
S_ybar_h = 1 / mbar_h * sqrt( sh.numerator / (n_h * (n_h-1)) )
)
# now expand up to strata
step3 <- step2 %>%
left_join(areas) %>%
mutate(
W_h = A_h / sum(A_h)
) %>%
summarise(
A = sum(A_h),
ybar_strat = sum(W_h * ybar_h),
S_ybar_strat = sum(W_h * S_ybar_h / sqrt(n_h))
) %>%
mutate(
tot = A * ybar_strat,
S_tot = A * S_ybar_strat
)
step2
step3
This gives the following output:
> step2
# A tibble: 3 x 6
Stratum ybar_h n_h sh.numerator mbar_h S_ybar_h
<fctr> <dbl> <int> <dbl> <dbl> <dbl>
1 A 3.0 2 18.0 2.500000 1.200000
2 B 9.5 3 112.5 2.666667 1.623798
3 C 17.5 3 94.5 2.666667 1.488235
> step3
# A tibble: 1 x 5
A ybar_strat S_ybar_strat tot S_tot
<dbl> <dbl> <dbl> <dbl> <dbl>
1 70 13.14286 0.8800657 920 61.6046
(Revised answer to revised question)
In this case svytotal isn't what you want -- it's for the actual population total of the elements being sampled, and so doesn't make sense when the population is thought of as infinitely bigger than the sample. The whole survey package is really designed for discrete, finite populations, but we can work around it.
I think you want to get a mean for each stratum and then multiply it by the stratum weights. To do that,
D <- svydesign(id=~Cluster, strata=~Stratum, data=both)
means<- svyby(~Count, ~Stratum, svymean, design=D)
svycontrast(means, quote(10*A+20*B+40*C))
You'll get a warning
Warning message:
In vcov.svyby(stat) : Only diagonal elements of vcov() available
That's because svyby doesn't return covariances between the stratum means. It's harmless, because the strata really are independent samples (that's what stratification means) so the covariances are zero.
svytotal is doing what I think it should do here: weights are based on sampling probability, so they are only defined for sampling units. The svydesign call applied those weights to the clusters and (because cluster sampling) to the elements, giving the 530-fold higher total. You need to supply either observation weights or enough information for svydesign to calculate them itself. If this is cluster sampling with no subsampling, you can divide the stratum weight over the clusters to get the cluster weight and the divide this over elements within a cluster to get the observation weight. Or, if the stratum weight is the number of clusters in the population, you can use the fpc argument to svydesign
The fact that the SE doesn't scale the same way as the point estimate is because the population size is unknown and has to be estimated. The mean is the estimated total divided by the estimated population size, and the SE estimate takes account of the variance of the denominator and its covariance with the numerator.
Here's what I tried, making use of the mvtnorm package
Sample Dataset
library(mvtnorm)
set.seed(2357)
df <- data.frame(
x = rnorm(1000, mean=80, sd=20),
y = rnorm(1000, mean=0, sd=5),
z = rnorm(1000, mean=0, sd=5)
)
head(df)
x y z
1 70.38 1.307 0.2005
2 59.76 5.781 -3.5095
3 54.14 -1.313 -1.9022
4 79.91 7.754 -6.2076
5 87.07 1.389 1.1065
6 75.89 1.684 6.2979
Fit multivariate normal dist and check P(x <= 80) ~ 0.5
# Get the dimension means and correlation matrix
means <- c(x=mean(df$x), y=mean(df$y), z=mean(df$z))
corr <- cor(df)
# Check P(x <= 80)
sum(df$x <= 80)/nrow(df) # 0.498
pmvnorm(lower=-Inf, upper=c(80, Inf, Inf), mean=means, corr=corr) # 0.8232
Why is the fitted result 0.82? Where did I go wrong?
First, you don't need to simulate anything to study the pmvnorm function:
pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=c(80,0,0), corr=diag(rep(1,3)))
The result is 0.5, as you expected.
Your means vector is approximately (79, 0, 0), so let's try it:
pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=c(79,0,0), corr=diag(rep(1,3)))
The result now is 0.8413447. There's nothing the matter. By specifying only the correlation matrix, you told the software to assume that all variances were unity. In your simulation, the variances were 400, 25, and 25: very different from what you specified in the arguments!
The correct calculation uses the covariance matrix of the data, not its correlation matrix:
pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=means, sigma=cov(df))
The result is 0.5178412, quite in keeping with the data.
My aim is to determine the distance between site of injection of a treatment and target of this treatment with a 0.95 probability of success.
The outcome variable was a binary variable (Success:1/failure:0)
I used Dixon up and down methodology with six distances tested : 0, 2, 4, 6, 8 and 10 mm.
Here are my data :
column 1 : distances used
column 2 : number of success
column 3 : total number of patients
data <- data.frame(1:6,1:6,1:6)
data[,1] <- c(0, 2, 4, 6, 8, 10)
data[,2] <- c(2, 12, 3, 2, 1, 0)
data[,3] <- c(2, 12, 15, 8, 4, 1)
names(data) <- c("Distance", "Success", "Total")
I built a model with DRC package 2.3-96 and R ver 3.1.2 on Windows Vista Os :
library(drc)
model <- drm(Success/Total~Distance, weights=Total,
data=data, fct=LL.2(), type="binomial")
summary(model)
plot(model, bp=.5, legend=FALSE
, xlab=paste("Distance"), ylab="Probability of success", lwd=2,
cex=1.2, cex.axis=1.2, cex.lab=1.2, log = "")
All seems to be Ok
but when it come to estimating ED 95 (Effective dose 95 : distance required to have 0.95 probability of success), i think that this ED95 was inverted with ED5 (Effective dose 5 : distance required to have 0.05 probability of success) :
ED(model, 95, interval="delta")
ED(model, 5, interval="delta")
ED95 : 8.0780 SE: 2.0723 CI 95 % (4.0165 ; 12.139)
ED5 : 1.58440 SE: 0.46413 CI 95 % (0.67472 ; 2.4941)
ED values in drc package are by default calculated relative to the the control level. In our case, we are looking for ED values calculated relative to the upper limit.
So we must change the reference value from "control" (default) to "upper" :
ED(model, 95, interval="delta", reference = "upper")
Many thanks to Christian Ritz
I am new both to R and statistics. I am playing with maximum likelihood estimation, and I am getting some incorrect results. I want to model x with a simple linear function:
x<-apply(matrix(seq(1,10,1), nrow=1), 1, function(x) 10*x+runif(10,-3,3))
LL<-function(a,b){
R=apply(x,1,function(y) a*y+b)
-sum(log(R))
}
mle(LL, start=list(a=10, b=0))
I am getting the following result:
Coefficients:
a b
43571.957 1338.345
instead of a~10, b~0.
I modified the code according to the suggestions of Spacedman:
set.seed(99)
x<-apply(matrix(seq(1,10,1), nrow=1), 1, function(x) 10*x+runif(10,-3,3))
LL<-function(a,b){
R = x[,1] - a*(1:10) + b
-sum(R^2)
}
library(stats4)
mle(LL, start=list(a=11, b=0.3))
Error in solve.default(oout$hessian) :
Lapack routine dgesv: system is exactly singular: U[1,1] = 0
I do not know how to get rid of this error. Changing the sees and generating the x values again does not help.
There are a couple of things to notice here. To clarify we start by changing the distribution of the error-term from a uniform distribution runif(x, -3, 3) to the std. normal distribution: rnorm(x). We can now easily simulate your data, then set up your (minus) loglikelihood and maximize (minizime) by:
a <- 10
b <- 0
set.seed(99)
x <- apply(matrix(seq(1, 10, 1), nrow=1), 1, function(x) b + a * x + rnorm(10))
minuslogL <- function(a, b) -sum(dnorm(x[, 1] - (b + a * 1:10), log = TRUE))
library(stats4)
mle(minuslogL, start = list(a = 11, b = 0.3))
Call:
mle(minuslogl = minuslogL, start = list(a = 11, b = 0.3))
Coefficients:
a b
9.8732793 0.5922192
Notice that this works well, since the likelihood is smooth and mle() uses "BFGS" for the optimization, eg. a quasi-Newton, gradient approach. Lets try the same with uniform errors:
set.seed(99)
x <- apply(matrix(seq(1, 10, 1), nrow=1), 1, function(x) b + a * x + runif(10, -3, 3))
minuslogL2 <- function(a,b) -sum(dunif(x[, 1] -(a * 1:10 + b), -3, 3, log = TRUE))
mle(minuslogL2, start = list(a = 11, b = 0.3))
Error in optim(start, f, method = method, hessian = TRUE, ...) :
initial value in 'vmmin' is not finite
This fails! Why? Since the uniform-errors restrict the parameter space, you will not get a smooth likelihood. If you move your parameters a,b too far away from the true values, you will get Inf. If you move close enough, you will get the same likelihood (eg. many possible min. values):
> minuslogL2(11, 0.3)
[1] Inf
> minuslogL2(10, 0)
[1] 17.91759
> minuslogL2(10.02, 0.06)
[1] 17.91759
Maximizing this likelihood compares to finding the set: {a,b}: -logL(a, b) == -logL(10, 0), which can be found by a plain search algorithm.