Stratified cluster sampling estimates from survey package - r

I want to estimate means and totals from a stratified sampling design in which single stage cluster sampling was used in each stratum. I believe I have the design properly specified using the svydesign() function of the survey package. But I'm not sure how to correctly specify the stratum weights.
Example code is shown below. I provide unadjusted stratum weights using the weights= argument. I expected that the estimate and the SE from svytotal() would be equal to the sum of the stratum weights (70, in the example) times the estimate and SE from svymean(). Instead the estimates differ by a factor of 530 (which is the sum of the stratum weights over all of the elements in the counts data) and the SEs differ by a factor of 898 (???). My questions are (1) how can I provide my 3 stratum weights to svydesign() in a way that it understands, and (2) why aren't the estimates and SEs from svytotal() and svymean() differing by the same factor?
library(survey)
# example data from a stratified sampling design in which
# single stage cluster sampling is used in each stratum
counts <- data.frame(
Stratum=rep(c("A", "B", "C"), c(5, 8, 8)),
Cluster=rep(1:8, c(3, 2, 3, 2, 3, 2, 3, 3)),
Element=c(1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 3),
Count = 1:21
)
# stratum weights
weights <- data.frame(
Stratum=c("A", "B", "C"),
W=c(10, 20, 40)
)
# combine counts and weights
both <- merge(counts, weights)
# estimate mean and total count
D <- svydesign(id=~Cluster, strata=~Stratum, weights=~W, data=both)
a <- svymean(~Count, D)
b <- svytotal(~Count, D)
sum(weights$W) # 70
sum(both$W) # 530
coef(b)/coef(a) # 530
SE(b)/SE(a) # 898.4308
First update
I'm adding a diagram to help explain my design. The entire population is a lake with known area (70 ha in this example). The strata have known areas, too (10, 20, and 40 ha). The number of clusters allocated to each stratum was not proportional. Also, the clusters are tiny relative to the number that could possibly be sampled, so the finite population correction is FPC = 1.
I want to calculate an overall mean and SE on a per unit area basis and a total that is equal to 70 times this mean and SE.
Second update
I wrote the code to do the calculations from scratch. I get a total estimate of 920 with se 61.6.
library(survey)
library(tidyverse)
# example data from a stratified sampling design in which
# single stage cluster sampling is used in each stratum
counts <- data.frame(
Stratum=rep(c("A", "B", "C"), c(5, 8, 8)),
Cluster=rep(1:8, c(3, 2, 3, 2, 3, 2, 3, 3)),
Element=c(1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 3),
Count = c(5:1, 6:21)
)
# stratum weights
areas <- data.frame(
Stratum=c("A", "B", "C"),
A_h=c(10, 20, 40)
)
# calculate cluster means
step1 <- counts %>%
group_by(Stratum, Cluster) %>%
summarise(P_hi = sum(Count), m_hi=n())
step2 <- step1 %>%
group_by(Stratum) %>%
summarise(
ybar_h = sum(P_hi) / sum(m_hi),
n_h = n(),
sh.numerator = sum((P_hi - ybar_h*m_hi)^2),
mbar_h = mean(m_hi)
) %>%
mutate(
S_ybar_h = 1 / mbar_h * sqrt( sh.numerator / (n_h * (n_h-1)) )
)
# now expand up to strata
step3 <- step2 %>%
left_join(areas) %>%
mutate(
W_h = A_h / sum(A_h)
) %>%
summarise(
A = sum(A_h),
ybar_strat = sum(W_h * ybar_h),
S_ybar_strat = sum(W_h * S_ybar_h / sqrt(n_h))
) %>%
mutate(
tot = A * ybar_strat,
S_tot = A * S_ybar_strat
)
step2
step3
This gives the following output:
> step2
# A tibble: 3 x 6
Stratum ybar_h n_h sh.numerator mbar_h S_ybar_h
<fctr> <dbl> <int> <dbl> <dbl> <dbl>
1 A 3.0 2 18.0 2.500000 1.200000
2 B 9.5 3 112.5 2.666667 1.623798
3 C 17.5 3 94.5 2.666667 1.488235
> step3
# A tibble: 1 x 5
A ybar_strat S_ybar_strat tot S_tot
<dbl> <dbl> <dbl> <dbl> <dbl>
1 70 13.14286 0.8800657 920 61.6046

(Revised answer to revised question)
In this case svytotal isn't what you want -- it's for the actual population total of the elements being sampled, and so doesn't make sense when the population is thought of as infinitely bigger than the sample. The whole survey package is really designed for discrete, finite populations, but we can work around it.
I think you want to get a mean for each stratum and then multiply it by the stratum weights. To do that,
D <- svydesign(id=~Cluster, strata=~Stratum, data=both)
means<- svyby(~Count, ~Stratum, svymean, design=D)
svycontrast(means, quote(10*A+20*B+40*C))
You'll get a warning
Warning message:
In vcov.svyby(stat) : Only diagonal elements of vcov() available
That's because svyby doesn't return covariances between the stratum means. It's harmless, because the strata really are independent samples (that's what stratification means) so the covariances are zero.

svytotal is doing what I think it should do here: weights are based on sampling probability, so they are only defined for sampling units. The svydesign call applied those weights to the clusters and (because cluster sampling) to the elements, giving the 530-fold higher total. You need to supply either observation weights or enough information for svydesign to calculate them itself. If this is cluster sampling with no subsampling, you can divide the stratum weight over the clusters to get the cluster weight and the divide this over elements within a cluster to get the observation weight. Or, if the stratum weight is the number of clusters in the population, you can use the fpc argument to svydesign
The fact that the SE doesn't scale the same way as the point estimate is because the population size is unknown and has to be estimated. The mean is the estimated total divided by the estimated population size, and the SE estimate takes account of the variance of the denominator and its covariance with the numerator.

Related

Bootstrap in pair within a matched sample

Hi I would like to compare outcomes between the treatment and control groups by bootstrapping the matched sample. This matched sample was obtained by using the genetic matching approach provided by MatchIt. According to this paper
Austin, P. C., & Small, D. S. (2014). The use of bootstrapping when using propensity‐score matching without replacement: a simulation study. Statistics in medicine, 33(24), 4306-4319.
I think I have to bootstrap the matched sample based on each matched pair, not each individual. However, I don't know how to re-sample by each matched pair.
Here I provide an example:
id <- c("A", "B", "C", "D", "E", "F")
treatment <- c(1, 0, 1, 0, 1, 0)
subclass <- c(1, 1, 2, 2, 3, 3)
outcome1 <- c(100, 300, 400, 500, 600, 700)
outcome2 <- c(200, 50, 600, 800, 900, 1000)
matched_sample <- data.frame(id, treatment, subclass, outcome1, outcome2)
> matched_sample
id treatment subclass outcome1 outcome2
1 A 1 1 100 200
2 B 0 1 300 50
3 C 1 2 400 600
4 D 0 2 500 800
5 E 1 3 600 900
6 F 0 3 700 1000
Subclass indicates the matched pair. For example, individual A and B are a matched pair because they share the same subclass number. Whenever A appears in any sample, B should also appear in that sample.
After bootstrapping, I will run regression on outcome1 and outcome2 to estimate the average treatment effects (ATE), and also to obtain the 95% confidence intervals of the ATEs.
I think the package boot might be useful, but I'm not sure how to use it. I would be really grateful for your help on this.
EDIT: The ATEs that I would like to estimate are basically the coefficients of "treatment" in regressions. That is,
lm.ATE1 <- lm(outcome1 ~ treatment)
lm.ATE2 <- lm(outcome2 ~ treatment)
The idea is to bootstrap the matched sample for 10,000 times, estimate these two regressions within each bootstrapped samples, rank the resulting coefficients, and then find coefficients at the 2.5 and 97.5 percentile as the 95% confidence intervals for the ATE on outcome1 and outcome2 respectively.
Hopefully this clarifies. Thanks in advance.
The following function resamples from matched_sample R times, keeping matched pairs. Then it computes two regressions and extracts the coefficients the question names ATE*, returning a matrix 2xR. Finally, it uses apply to get the percentile 95% confidence intervals.
id <- c("A", "B", "C", "D", "E", "F")
treatment <- c(1, 0, 1, 0, 1, 0)
subclass <- c(1, 1, 2, 2, 3, 3)
outcome1 <- c(100, 300, 400, 500, 600, 700)
outcome2 <- c(200, 50, 600, 800, 900, 1000)
matched_sample <- data.frame(id, treatment, subclass, outcome1, outcome2)
fun_boot <- function(data, R = 10000L) {
f <- function() {
b <- sample(uniq_sclass, n, TRUE)
out <- sp[match(b, uniq_sclass)]
out <- do.call(rbind, out)
lm.ATE1 <- lm(outcome1 ~ treatment, out)
lm.ATE2 <- lm(outcome2 ~ treatment, out)
c(ATE1 = unname(coef(lm.ATE1))[2],
ATE2 = unname(coef(lm.ATE2))[2])
}
sp <- split(data, data$subclass)
n <- length(sp)
uniq_sclass <- names(sp)
replicate(R, f())
}
set.seed(2022)
# change this value to 10,000
R <- 10L
bootres <- fun_boot(matched_sample, R)
t(apply(bootres, 1, quantile, probs = c(0.025, 0.975)))
#> 2.5% 97.5%
#> ATE1 -159.1667 -100.00000
#> ATE2 -159.1667 47.91667
Created on 2022-08-12 by the reprex package (v2.0.1)

How to Perform Statistical Two-Sided Test for Independence (on Proportion) in R?

I am trying to compare two percentages/proportions for statistical significance in R, using a Chi-Square test. I am familiar with a SAS method for Chi Square in which I supply a dataset column for a numerator, another column for denominator, and a categorical variable to distinguish distributions (A/B).
However I am getting unexpected values in R using some examples sets. When I test two similar populations, with low sample sizes, I am getting p-values of (approximately) zero, where I would expect the p-values to be very high (~ 1).
My test set is below, where I went with sugar content in a batch of water: e.g. "does group A use the same ratio of sugar as group B?". My actual problem is similar, where this isn't a pass-fail type test and the numerator and denominator values can vary wildly between samples (different sugar and/or water weights per sample). My first objective is to verify that I can get a high p-value from two similar sets. The next question is, at what sample size does the p-value become low enough to indicate significance?
# CREATE 2 NEARLY-EQUAL DISTRIBUTIONS (EXPECTING HIGH P-VALUE FROM PROP.TEST)
set.seed(108)
group_A = tibble(group = "A", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
group_B = tibble(group = "B", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
batches <- rbind(group_A, group_B)
I then do a summarize to calculate the overall sugar percentage tendency between groups:
# SUMMARY TOTALS
totals <- batches %>%
group_by(group) %>%
summarize(batch_count = n(),
batch_lbs_sum = sum(batch_lbs),
sugar_lbs_sum = sum(sugar_lbs),
sugar_percent_overall = sugar_lbs_sum / batch_lbs_sum) %>%
glimpse()
I then supply the sugar percentage between groups to a prop.test, expecting a high p-value
# ADD P-VALUE & CONFIDENCE INTERVAL
stats <- totals %>%
rowwise() %>%
summarize(p_val = prop.test(x = sugar_percent_overall, n = batch_count, conf.level = 0.95, alternative = "two.sided")$p.value) %>%
mutate(p_val = round(p_val, digits = 3)) %>%
mutate(conf_level = 1 - p_val) %>%
select(p_val, conf_level) %>%
glimpse()
# FINAL SUMMARY TABLE
cbind(totals, stats) %>%
glimpse()
Unforunately the final table gives me a p-value of 0, suggesting the two nearly-identical sets are independent/different. Shouldn't I get a p-value of ~1?
Observations: 2
Variables: 7
$ group <chr> "A", "B"
$ batch_count <int> 50, 50
$ batch_lbs_sum <dbl> 1475.579, 1475.547
$ sugar_lbs_sum <dbl> 495.4983, 484.6928
$ sugar_percent_overall <dbl> 0.3357992, 0.3284833
$ p_val <dbl> 0, 0
$ conf_level <dbl> 1, 1
From another angle, I also tried to compare the recommended sample size from power.prop.test with an actual prop.test using this recommended sample size. This gave me the reverse problem -- I was a expecting low p-value, since I am using the recommended sample size, but instead get a p-value of ~1.
# COMPARE PROP.TEST NEEDED COUNTS WITH AN ACTUAL PROP.TEXT
power.prop.test(p1 = 0.33, p2 = 0.34, sig.level = 0.10, power = 0.80, alternative = "two.sided") ## n = 38154
prop.test(x = c(0.33, 0.34), n = c(38154, 38154), conf.level = 0.90, alternative = "two.sided") ## p = 1 -- shouldn't p be < 0.10?
Am I using prop.test wrong or am I misinterpreting something? Ideally, I would prefer to skip the summarize step and simply supply the dataframe, the numerator column 'sugar_lbs', and the denominator 'batch_lbs' as I do in SAS -- is this possible in R?
(Apologies for any formatting issues as I'm new to posting)
---------------------------------
EDIT - EXAMPLE WITH ONLY PROPORTIONS & SAMPLE SIZE
I think my choice of using normal distributions may have distracted from the original question. I found an example that gets to the heart of what I was trying to ask, which is how to use prop test given only a proportion/percentage and the sample size. Instead of city_percent and city_total below, I could simply rename these to sugar_percent and batch_lbs. I think this reference answers my question, where prop.test appears to be the correct test to use.
My actual problem has an extremely non-normal distribution, but is not easily replicated via code.
STANFORD EXAMPLE (pages 37-50)
- https://web.stanford.edu/class/psych10/schedule/P10_W7L1
df <- tibble(city = c("Atlanta", "Chicago", "NY", "SF"), washed = c(1175, 1329, 1169, 1521), not_washed = c(413, 180, 334, 215)) %>%
mutate(city_total = washed + not_washed,
city_percent = washed / city_total) %>%
select(-washed, -not_washed) %>%
glimpse()
# STANFORD CALCULATION (p = 7.712265e-35)
pchisq(161.74, df = 3, lower.tail = FALSE)
# PROP TEST VERSION (SAME RESULT, p = 7.712265e-35)
prop.test(x = df$city_percent * df$city_total, n = df$city_total, alternative = "two.sided", conf.level = 0.95)$p.value
The documentation for prop.test says:
Usage prop.test(x, n, p = NULL,
alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, correct = TRUE)
Arguments
x a vector of counts of successes, a one-dimensional table with two entries, or a
two-dimensional table (or matrix) with 2 columns, giving the counts of
successes and failures, respectively.
n a vector of counts of trials; ignored if x is a matrix or a table.
So if you want a "correct" test, you would have to use sugar_lbs_sum as the x instead of sugar_percent_overall. You should still receive some kind of warning that the x is non-integral, but that's not my major concern.
But from a statistical perspective this is the complete wrong way of doing things. You are directly causing spurious correlation for a testing of difference between two quantities by dividing by their sum arbitrarily. If the samples (sugar_lbs_sum) are independent, but you divide by their sums, you have made the ratios dependent. This violates the assumptions of the statistical test in a critical way. Kronmal 1993 "Spurious correlation and the fallacy of the ratio" covers this.
The data you generated are independent normal, so don't sum them, rather test for a difference with the t-test.
The Stanford link I added to my original post answered my question. I modified the Stanford example to simply rename the variables from city to group, and washed counts to sugar_lbs. I also doubled one batch, (or comparing a small versus large city). I now get the expected high p-value (0.65) indicating that there is no statistical significance that the proportions are different.
When I add more groups (for more degrees of freedom) and continue to vary batch sizes proportionally, I continue to get high p-values as expected, confirming the recipe is the same. If I modify the sugar percent of any one group, the p-value immediately drops to zero indicating one of the groups is different, as expected.
Finally, when doing the prop.text within a 'dplyr' pipe, I found I should not have used the rowwise() step, which causes my p-values to fall to zero. Removing this step gives the correct p-value. The only downside is that I don't yet know which group is different until I compare only 2 groups at a time iteratively.
#---------------------------------------------------------
# STANFORD EXAMPLE - MODIFIED TO SUGAR & ONE DOUBLE BATCHED
#--------------------------------------------------------
df <- tibble(group = c("A", "B"), sugar_lbs = c(495.5, 484.7), water_lbs = c(1475.6 - 495.5, 1475.6 - 484.7)) %>%
mutate(sugar_lbs = ifelse(group == "B", sugar_lbs * 2, sugar_lbs),
water_lbs = ifelse(group == "B", water_lbs * 2, water_lbs)) %>%
mutate(batch_lbs = sugar_lbs + water_lbs,
sugar_percent = sugar_lbs / batch_lbs) %>%
glimpse()
sugar_ratio_all <- sum(df$sugar_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
water_ratio_all <- sum(df$water_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
dof <- (2 - 1) * (length(df$group) - 1)
df <- df %>%
mutate(sugar_expected = (sugar_lbs + water_lbs) * sugar_ratio_all,
water_expected = (sugar_lbs + water_lbs) * water_ratio_all) %>%
mutate(sugar_chi_sq = (sugar_lbs - sugar_expected)^2 / sugar_expected,
water_chi_sq = (water_lbs - water_expected)^2 / water_expected) %>%
glimpse()
q <- sum(df$sugar_chi_sq) + sum(df$water_chi_sq)
# STANFORD CALCULATION
pchisq(q, df = dof, lower.tail = F)
# PROP TEST VERSION (SAME RESULT)
prop.test(x = df$sugar_percent * df$batch_lbs, n = df$batch_lbs, alternative = "two.sided", conf.level = 0.95)$p.value

How to write a function that compares points in time series data using paired t-tests

In the data set created below, lets pretend that I randomly picked up 20 flat rocks. Each of these rocks were assigned a unique ID number. I measured the concentration of 7 substances (Copper,Iron,Carbon,Lead,Mg,CaCO, and Zinc) across the surface of the longest axis of each rock. Distance is recorded in mm, and therefore is a function of each rocks length. Note that not all Rocks are of the same length. Location is a grouping variable that describes where the Rock was picked up.
ID <- data.frame(ID=rep(c(12,122,242,329,595,130,145,245,654,878), each = 200))
ID2 <- data.frame(ID=rep(c(863,425,24,92,75,3,200,300,40,500), each = 300))
RockID<-data.frame(RockID = c(unlist(ID), unlist(ID2)))
Location <- rep(c("Alpha","Beta","Charlie","Delta","Echo"), each = 1000)
a <- rep(c(1:200),times = 10)
b <- rep(c(1:300), times = 10)
Time <- data.frame(Time = c(unlist(a), unlist(b)))
set.seed(1)
Copper <- rnorm(5000, mean = 0, sd = 5)
Iron <- rnorm(5000, mean = 0, sd = 10)
Carbon <- rnorm(5000, mean = 0, sd = 1)
Lead <- rnorm(5000, mean = 0, sd = 4)
Mg <- rnorm(5000, mean = 0, sd = 6)
CaCO <- rnorm(5000, mean = 0, sd = 2)
Zinc <- rnorm(5000, mean = 0, sd = 3)
data <-cbind(RockID, Location, Time,Copper,Iron,Carbon,Lead,Mg,CaCO,Zinc)
data$ID <- as.factor(data$RockID)
For each substance, I want to know if there is typically a significant difference between the following points in these rocks:
1. The first observation and the average of the first 5, average of the first 7, and average of the first 10 observations
2. The last observation and the average of the last 5 average of the last 7, and average of the last 10 observations
3. The first observation and the last observation
How would I write a function that will conduct a t-test on each of these points (or the average of n-points) for each variable, and paste the p-values in a data frame structured something like this:
Points Copper Iron Carbon Lead Mg CaCO Zinc
Point1.vs.First5 … … … … … … …
Point1.vs.First7
Point1.vs.First10
Pointn.vs.Last5
Pointn.vs.Last7
Pointn.vs.Last10
Point1.vs.Pointn

Draw ID's from a data frame based on conditions and probabilities for the conditions

I have the following data frame
ID <- c(1,2,3,4,5,6)
pop <- c(10,10,10,10,1000,1000)
df <- data.frame(pop,ID)
Now I would like to draw unique ID's from this data frame with the condition, that the probability, that I draw an ID with a population of larger or equal of 1000 being 0.1 and that the probability for drawing an ID with a population of lower than 1000 being 0.9.
This is the code, that I have come up with:
sample(c(df$ID[df$pop>=1000],df$ID[df$pop<1000]),3, prob=c(0.1,0.9))
However, I get the following error message:
Error in sample.int(length(x), size, replace, prob) :
incorrect number of probabilities
How can I rewrite this problem so it works without assigning a probability to every ID?
Maybe this one is your solution:
# 1. Data set
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6),
population = c(10, 10, 10, 10, 1000, 1000))
# 2. Set 'probability' by 'population'
df <- df %>%
mutate(probability = ifelse((population >= 1000), 0.1, 0.9))
A simple way could be to create a probability vector for each ID and pass that to sample function. For IDs having pop >=1000 probability vector should have 0.1 and for IDs having pop <1000 vector should contain 0.9.
# Create a probability vector for each ID and pass it to sample function.
sample(df$ID, 3, prob = ifelse(df$pop>=1000, 0.1, 0.9))
#[1] 4 3 2
sample(df$ID, 3, prob = ifelse(df$pop>=1000, 0.1, 0.9))
#[1] 2 1 4
sample(df$ID, 3, prob = ifelse(df$pop>=1000, 0.1, 0.9))
#[1] 3 2 4
Data:
ID <- c(1,2,3,4,5,6)
pop <- c(10,10,10,10,1000,1000)
df <- data.frame(pop,ID)

Inverted ED95 and ED5 in drc package

My aim is to determine the distance between site of injection of a treatment and target of this treatment with a 0.95 probability of success.
The outcome variable was a binary variable (Success:1/failure:0)
I used Dixon up and down methodology with six distances tested : 0, 2, 4, 6, 8 and 10 mm.
Here are my data :
column 1 : distances used
column 2 : number of success
column 3 : total number of patients
data <- data.frame(1:6,1:6,1:6)
data[,1] <- c(0, 2, 4, 6, 8, 10)
data[,2] <- c(2, 12, 3, 2, 1, 0)
data[,3] <- c(2, 12, 15, 8, 4, 1)
names(data) <- c("Distance", "Success", "Total")
I built a model with DRC package 2.3-96 and R ver 3.1.2 on Windows Vista Os :
library(drc)
model <- drm(Success/Total~Distance, weights=Total,
data=data, fct=LL.2(), type="binomial")
summary(model)
plot(model, bp=.5, legend=FALSE
, xlab=paste("Distance"), ylab="Probability of success", lwd=2,
cex=1.2, cex.axis=1.2, cex.lab=1.2, log = "")
All seems to be Ok
but when it come to estimating ED 95 (Effective dose 95 : distance required to have 0.95 probability of success), i think that this ED95 was inverted with ED5 (Effective dose 5 : distance required to have 0.05 probability of success) :
ED(model, 95, interval="delta")
ED(model, 5, interval="delta")
ED95 : 8.0780 SE: 2.0723 CI 95 % (4.0165 ; 12.139)
ED5 : 1.58440 SE: 0.46413 CI 95 % (0.67472 ; 2.4941)
ED values in drc package are by default calculated relative to the the control level. In our case, we are looking for ED values calculated relative to the upper limit.
So we must change the reference value from "control" (default) to "upper" :
ED(model, 95, interval="delta", reference = "upper")
Many thanks to Christian Ritz

Resources