Calculating 95% Confidence Interval for Several Columns at Once in R - r

I have survey data (picture sample below) I'm working with to find 95% confidence intervals for. The Q#d columns (Q1d, Q2d, etc.) each correspond to different questions on the survey (Likert scale with results dichotomized, 1 = yes, 0 = no). The intervention column describes whether the results were before intervention (FALSE) or after intervention (TRUE). What I want to do is get the 95% confidence intervals on the difference in proportions for each question before and after intervention.
For example, let's say for Q1d the proportion that answered "yes" before intervention is .2 and after the intervention is .5. The difference would be .3 or 30%, and I want to calculate the confidence interval (let's say between 25% and 35%) on the difference. I want to do this for every single question in the survey (all Q1d). I have not been able to find a way to iterate through and do this for all questions (columns). I've written a function that can successfully do it for one column, but iterating through column names isn't working for me, and I don't know how to store the results as a vector/dataframe. I've included the function below. Any guidance?
Thanks so much!!
get_conf_int <- function(df, colName) {
myenc <- enquo(colName)
p <- df %>%
group_by(Intervention) %>%
summarize(success=sum(UQ(myenc)==1, na.rm=TRUE), total=n())
prop.test(x=pull(p,success), n=pull(p, total))$conf.int[2:1]*-100
}
And I can call the function like:
get_conf_int(db, Q1d)
I'm using prop.test to find confidence interval for now, but open to other methods as well.

I can't assure you if prop.table is better than binom.test, you should read more about those two.
library(dplyr)
# just for this example, you have your survey here
df <- data.frame(Intervention=sample(x = c(TRUE,FALSE), size = 20, replace = TRUE),
Q1d=sample(x = 0:1, size = 20, replace = TRUE),
Q2d=sample(x = 0:1, size = 20, replace = TRUE),
Q3d=sample(x = 0:1, size = 20, replace = TRUE),
Q4d=sample(x = 0:1, size = 20, replace = TRUE),
Q5d=sample(x = 0:1, size = 20, replace = TRUE),
Q6d=sample(x = 0:1, size = 20, replace = TRUE),
Q7d=sample(x = 0:1, size = 20, replace = TRUE))
# vector with the sum of FALSE and the sum of TRUE
count_Intervention <- c(length(which(!df$Intervention)),length(which(df$Intervention)))
# group by TRUE/FALSE and sum(count) the 1's
df_sum <- df %>%
group_by(Intervention) %>%
summarize(across((colnames(df)[-1]),list(sum)))
# for new info. I added the pvalue, that might be important
new_df <- data.frame(Question=as.character(), LowerConfInt=as.numeric(), UpperConfInt=as.numeric(), Pvalue = as.numeric())
#loop
for (Q_d in colnames(df_sum)[-1]) {
lower <- prop.test(as.vector(t(df_sum[,Q_d])), count_Intervention)$conf.int[1]
upper <- prop.test(as.vector(t(df_sum[,Q_d])), count_Intervention)$conf.int[2]
pvalue <- prop.test(as.vector(t(df_sum[,Q_d])), count_Intervention)$p.value
new_df <- rbind(new_df, data.frame(Q_d, lower, upper, pvalue))
}
new_df
Q_d lower upper pvalue
1 Q1d_1 -0.2067593 0.8661000 0.34844258
2 Q2d_1 -0.9193444 -0.1575787 0.05528499
3 Q3d_1 -0.4558861 0.5218202 1.00000000
4 Q4d_1 -0.4558861 0.5218202 1.00000000
5 Q5d_1 -0.7487377 0.3751114 0.74153726
6 Q6d_1 -0.2067593 0.8661000 0.34844258
7 Q7d_1 -0.4558861 0.5218202 1.00000000

Related

R: rowwise Confidence Intervals for a Difference of Binomials

I have a table with frequencies for control and treatment group for a multinomial factor (`response'), with three levels (Negative, Neutral, Positive). I want to calculate for each levelthe difference between treatment and control, and confidence intervals, and add them to the table.
I am looking for something that can be applied to several similar frequency tables that compare treatment and control groups, where the response categories vary (e.g. unlikely, 50-50, likely).
Here is the table:
N_A <- data.frame (response = c("Negative", "Neutral", "Positive"),
n_T = c(48, 43, 42), # treatment group
n_C = c(36, 40, 51) # control group
)
I have tried to use the BinomDiffCI function from the DescTools package. I managed to write a function that runs BinomDiffCI for the first row, and extracts the lower CI.
library(DescTools)
lci.diff <- function(){
xci <- BinomDiffCI(x1 = N_A[1,2], n1 = sum(N_A[2]), x2 = N_A[1,3], n2 = sum(N_A[3]), method=c("waldcc"))
xci[,2]
}
It's not great, but maybe a start. I want to 1) add difference and upper CI, 2) do the same for all rows, 3) attach this to the dataset, and 4) apply the same to other frequency tables comparing treatment and control.
Here is the code to create the lower and upper bounds of the confidence interval
library(DescTools)
ci_diff <- function(df, i) {
tbl <- BinomDiffCI(x1 = df[i,2], n1 = sum(df[2]), x2 = df[i,3], n2 = sum(df[3]), method=c("waldcc"))
tbl[ , c("lwr.ci", "upr.ci")]
}
N_A <- cbind(N_A, t(sapply(1:nrow(N_A), \(i) ci_diff(N_A, i)))
response n_T n_C lwr.ci upr.ci
1 Negative 48 36 -0.04342071 0.1982961
2 Neutral 43 40 -0.11268594 0.1293812
3 Positive 42 51 -0.20971246 0.0381418

Sampling Random Points Closer to Today?

I have this dataset in R:
date = sample(seq(as.Date('2015-01-01'), as.Date('2022-08-12'), by = "day"), 1000)
var1 = rnorm(1000, 1000,1000)
var2 = rnorm(1000, 1000,1000)
var3 = rnorm(1000, 1000,1000)
question_data = data.frame(date, var1, var2, var3)
question_data$id = 1:nrow(question_data)
I want to take 1000 random samples from this data such that "there are more points closer to today's date compared to the starting date".
I thought of a very simple way to do this - first, I order this dataset by date:
question_data <- question_data[order(-question_data$date),]
Then, I create a new "date_id":
question_data$date_id = 1:nrow(question_data)
From here, I choose an arbitrary cut-off and arbitrarily take weighted samples:
part_1 <- question_data[which(question_data$date_id > 750), ]
part_2 <- question_data[which(question_data$date_id < 750), ]
library(dplyr)
random_sample = data.frame(sample_n(part_1, 250, replace = TRUE), sample_n(part_2, 500, replace = TRUE))
Is there a better way to do this? Perhaps some methods that might be able to perform "smooth" random samples?
Thank you!
We can see the distribution of dates in the original data set if we do:
hist(lubridate::year(question_data$date), breaks = 2014:2022 + 0.5)
If we want to sample the dates more frequently as they get closer to the current time, we can first arrange the data frame in date order:
question_data <- question_data[order(question_data$date),]
Now, we can sample from all rows of the data frame, but we can specify the row number itself as a weighting, such that the probability of a particular row being selected goes from essentially 0 for row 1 to about 1 in 500 for the final row. Let's take a sample of 100 using this method and look at the histogram of dates:
n <- 100
samp <- question_data[sample(seq(nrow(question_data)), n, replace = FALSE,
prob = seq(nrow(question_data))),]
hist(lubridate::year(samp$date), breaks = 2014:2022 + 0.5)

r: for loop to simulate predictions when random sampling is applied

I am trying to simulate how replacement/reassignment of values on random samples affect predictions conveyed by AUC.
I have a tumor classification in a dataframe denoted df$who which has levels 1, 2, 3 corresponding to the severity of the tumor lesion.
Intro to the question
Lets say the baseline data looks like this:
set.seed(1)
df <- data.frame(
who = as.factor(sample(1:3, size = 6000, replace = TRUE, prob = c(0.8, 0.15, 0.05))),
age = round(runif(n = 6000, min = 18, max = 95), digits = 1),
gender = sample(c("m", "f"), size = 6000, replace = TRUE, prob = c(1/3, 2/3)),
event.time = runif(n = 6000, min = 8, max = 120),
event = as.factor(sample(0:2, size = 6000, replace = TRUE, prob = c(0.25, 0.2, 0.55)))
)
And a standard cause-specific Cox regression looks like:
library(survival)
a_baseline <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df, x = TRUE)
From which AUC can be obtained as a measure of predictive performance. Here, leave-one-out bootstrap on 5-year prediction on df$event == 1.
library(riskRegression)
u <- Score(list("baseline" = a_baseline),
Surv(event.time, event == 1) ~ 1,
data = df,
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# The AUC is then obtained
u$AUC$score$AUC[2]
Question
I want to simulate how re-classifying a random 5% of df$who == 1 to dfwho == 2 affect the 5-year prediction on df$event == 1
I want to create 10 separate and simulated subsets of the baseline data df, but each containing a random allocation of 5% df$who == 1 to .. == 2. Then, I want to apply each of these 10 separate and simulated subsets to predict the 5-year risk of df$event == 1.
I have applied a for loop to this. The expected output is dataframe that tells me which of the 10 simulated datasets yielded the highest and lowest u$AUC$score$AUC[2] (i.e., the best and worst prediction).
I am new to for loop, but here is my go (that obviously did not work).
all_auc <- data.frame() ## create a dataframe to fill in AUC from all 10 simulated sub-datasets
for(i in 1:10){ #1:10 represent the simulated datasets from 1 to 10
df[i] <- df #allocating baseline data to each of the 10 datasets
df[i]$who[sample(which(df[i]$who==1), round(0.05*length(which(df[i]$who==1))))]=2 #create the random 5% allocation of who==1 to who==2 in the i'th simulated dataset
ith_cox <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df[i], x = TRUE) #create the i'th Cox regression based on the i´th dataset
# create the predictions based on the i´th Cox
u[i] <- Score(list("baseline" = ith_cox),
Surv(event.time, event == 1) ~ 1,
data = df[i],
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# summarize all AUC from all 10 sub-datasets
all_auc <- u[i]$AUC$score$AUC[2]
}
(1) I could not get this for loop to work as described, and
(2) the final dataframe all_auc should provide only which of the 10 datasets yielded the worst and best predictions (I will then use these two data sets for further analysis).
A final note
This is only a reproducible example. The for loop will be applied to 10.000 simulated datasets in our analysis. I do not know if this could affect the answer - but, it illustrates the importance of the result: a dataframe (or vector?) that simply tells me which simulated dataset yielded the best vs worst predictions, and that I subsequently will be able to use these two dataframes for furter analysis, eg df2930 and df8939.

How to calculate the average of multiple standard deviations in R?

I am trying to figure out how to calculate the standard deviation of a dataset when I have a couple of standard deviations. Let's just look at this MWE:
set.seed(1234)
dummy_data <- data.frame(
"col_1" = sample(1:7, size = 10, replace = TRUE),
"col_2" = sample(1:7, size = 10, replace = TRUE),
"col_3" = sample(1:7, size = 10, replace = TRUE),
"col_4" = sample(1:7, size = 10, replace = TRUE)
)
Now since I know all the data points I can calculate the total standard deviation as follows:
> sd(as.matrix(dummy_data))
[1] 1.727604
But the real data that I have at hand is the following:
> dplyr::summarise_all(dummy_data, sd)
col_1 col_2 col_3 col_4
1 1.837873 1.873796 1.37032 1.888562
If I follow the usual method of calculating the average of multiple standard deviations with similar sample sizes, I would apply the following:
sds <- dplyr::summarise_all(dummy_data, sd)
vars <- sds^2
mean_sd <- sqrt(sum(vars) / (length(vars) - 1))
> mean_sd
[1] 2.027588
which is not the same! Now I have tried without the minus one:
> sqrt(sum(vars) / (length(vars)))
[1] 1.755942
which does not solve the problem. I have tried defining an own standard deviation function like this:
own_sd <- function(x) {
sqrt(sum((x - mean(x))^2) / length(x))
}
to eliminate the x - 1 in the dplyr::summarise_all() step, and then average according to the step above:
> sqrt(sum(dplyr::summarise_all(dummy_data, own_sd)^2) / 3)
[1] 1.923538
> sqrt(sum(dplyr::summarise_all(dummy_data, own_sd)^2) / 4)
[1] 1.665833
But all seem to give a different result than the sd(as.matrix()) method. What is going wrong here?
You can't calculate a global SD from only knowing group SDs. For example:
x1 = 1:5
x2 = 11:15
x3 = 101:105
## all the SDs are equal
(sd1 = sd(x1))
#[1] 1.581139
(sd2 = sd(x2))
#[1] 1.581139
(sd3 = sd(x3))
#[1] 1.581139
## however, combining the groups in pairs give very different results
sd(c(x1, x2))
# [1] 5.477226
sd(c(x1, x3))
# [1] 52.72571
This demonstrates that even if the sample sizes are identical, knowing the standard deviation of two groups does not help you calculate the standard deviation of those groups combined.
As per Merijn van Tilborg's comment, if you know the group sizes and the group means, the calculation is possible as shown here.

How to Perform Statistical Two-Sided Test for Independence (on Proportion) in R?

I am trying to compare two percentages/proportions for statistical significance in R, using a Chi-Square test. I am familiar with a SAS method for Chi Square in which I supply a dataset column for a numerator, another column for denominator, and a categorical variable to distinguish distributions (A/B).
However I am getting unexpected values in R using some examples sets. When I test two similar populations, with low sample sizes, I am getting p-values of (approximately) zero, where I would expect the p-values to be very high (~ 1).
My test set is below, where I went with sugar content in a batch of water: e.g. "does group A use the same ratio of sugar as group B?". My actual problem is similar, where this isn't a pass-fail type test and the numerator and denominator values can vary wildly between samples (different sugar and/or water weights per sample). My first objective is to verify that I can get a high p-value from two similar sets. The next question is, at what sample size does the p-value become low enough to indicate significance?
# CREATE 2 NEARLY-EQUAL DISTRIBUTIONS (EXPECTING HIGH P-VALUE FROM PROP.TEST)
set.seed(108)
group_A = tibble(group = "A", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
group_B = tibble(group = "B", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
batches <- rbind(group_A, group_B)
I then do a summarize to calculate the overall sugar percentage tendency between groups:
# SUMMARY TOTALS
totals <- batches %>%
group_by(group) %>%
summarize(batch_count = n(),
batch_lbs_sum = sum(batch_lbs),
sugar_lbs_sum = sum(sugar_lbs),
sugar_percent_overall = sugar_lbs_sum / batch_lbs_sum) %>%
glimpse()
I then supply the sugar percentage between groups to a prop.test, expecting a high p-value
# ADD P-VALUE & CONFIDENCE INTERVAL
stats <- totals %>%
rowwise() %>%
summarize(p_val = prop.test(x = sugar_percent_overall, n = batch_count, conf.level = 0.95, alternative = "two.sided")$p.value) %>%
mutate(p_val = round(p_val, digits = 3)) %>%
mutate(conf_level = 1 - p_val) %>%
select(p_val, conf_level) %>%
glimpse()
# FINAL SUMMARY TABLE
cbind(totals, stats) %>%
glimpse()
Unforunately the final table gives me a p-value of 0, suggesting the two nearly-identical sets are independent/different. Shouldn't I get a p-value of ~1?
Observations: 2
Variables: 7
$ group <chr> "A", "B"
$ batch_count <int> 50, 50
$ batch_lbs_sum <dbl> 1475.579, 1475.547
$ sugar_lbs_sum <dbl> 495.4983, 484.6928
$ sugar_percent_overall <dbl> 0.3357992, 0.3284833
$ p_val <dbl> 0, 0
$ conf_level <dbl> 1, 1
From another angle, I also tried to compare the recommended sample size from power.prop.test with an actual prop.test using this recommended sample size. This gave me the reverse problem -- I was a expecting low p-value, since I am using the recommended sample size, but instead get a p-value of ~1.
# COMPARE PROP.TEST NEEDED COUNTS WITH AN ACTUAL PROP.TEXT
power.prop.test(p1 = 0.33, p2 = 0.34, sig.level = 0.10, power = 0.80, alternative = "two.sided") ## n = 38154
prop.test(x = c(0.33, 0.34), n = c(38154, 38154), conf.level = 0.90, alternative = "two.sided") ## p = 1 -- shouldn't p be < 0.10?
Am I using prop.test wrong or am I misinterpreting something? Ideally, I would prefer to skip the summarize step and simply supply the dataframe, the numerator column 'sugar_lbs', and the denominator 'batch_lbs' as I do in SAS -- is this possible in R?
(Apologies for any formatting issues as I'm new to posting)
---------------------------------
EDIT - EXAMPLE WITH ONLY PROPORTIONS & SAMPLE SIZE
I think my choice of using normal distributions may have distracted from the original question. I found an example that gets to the heart of what I was trying to ask, which is how to use prop test given only a proportion/percentage and the sample size. Instead of city_percent and city_total below, I could simply rename these to sugar_percent and batch_lbs. I think this reference answers my question, where prop.test appears to be the correct test to use.
My actual problem has an extremely non-normal distribution, but is not easily replicated via code.
STANFORD EXAMPLE (pages 37-50)
- https://web.stanford.edu/class/psych10/schedule/P10_W7L1
df <- tibble(city = c("Atlanta", "Chicago", "NY", "SF"), washed = c(1175, 1329, 1169, 1521), not_washed = c(413, 180, 334, 215)) %>%
mutate(city_total = washed + not_washed,
city_percent = washed / city_total) %>%
select(-washed, -not_washed) %>%
glimpse()
# STANFORD CALCULATION (p = 7.712265e-35)
pchisq(161.74, df = 3, lower.tail = FALSE)
# PROP TEST VERSION (SAME RESULT, p = 7.712265e-35)
prop.test(x = df$city_percent * df$city_total, n = df$city_total, alternative = "two.sided", conf.level = 0.95)$p.value
The documentation for prop.test says:
Usage prop.test(x, n, p = NULL,
alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, correct = TRUE)
Arguments
x a vector of counts of successes, a one-dimensional table with two entries, or a
two-dimensional table (or matrix) with 2 columns, giving the counts of
successes and failures, respectively.
n a vector of counts of trials; ignored if x is a matrix or a table.
So if you want a "correct" test, you would have to use sugar_lbs_sum as the x instead of sugar_percent_overall. You should still receive some kind of warning that the x is non-integral, but that's not my major concern.
But from a statistical perspective this is the complete wrong way of doing things. You are directly causing spurious correlation for a testing of difference between two quantities by dividing by their sum arbitrarily. If the samples (sugar_lbs_sum) are independent, but you divide by their sums, you have made the ratios dependent. This violates the assumptions of the statistical test in a critical way. Kronmal 1993 "Spurious correlation and the fallacy of the ratio" covers this.
The data you generated are independent normal, so don't sum them, rather test for a difference with the t-test.
The Stanford link I added to my original post answered my question. I modified the Stanford example to simply rename the variables from city to group, and washed counts to sugar_lbs. I also doubled one batch, (or comparing a small versus large city). I now get the expected high p-value (0.65) indicating that there is no statistical significance that the proportions are different.
When I add more groups (for more degrees of freedom) and continue to vary batch sizes proportionally, I continue to get high p-values as expected, confirming the recipe is the same. If I modify the sugar percent of any one group, the p-value immediately drops to zero indicating one of the groups is different, as expected.
Finally, when doing the prop.text within a 'dplyr' pipe, I found I should not have used the rowwise() step, which causes my p-values to fall to zero. Removing this step gives the correct p-value. The only downside is that I don't yet know which group is different until I compare only 2 groups at a time iteratively.
#---------------------------------------------------------
# STANFORD EXAMPLE - MODIFIED TO SUGAR & ONE DOUBLE BATCHED
#--------------------------------------------------------
df <- tibble(group = c("A", "B"), sugar_lbs = c(495.5, 484.7), water_lbs = c(1475.6 - 495.5, 1475.6 - 484.7)) %>%
mutate(sugar_lbs = ifelse(group == "B", sugar_lbs * 2, sugar_lbs),
water_lbs = ifelse(group == "B", water_lbs * 2, water_lbs)) %>%
mutate(batch_lbs = sugar_lbs + water_lbs,
sugar_percent = sugar_lbs / batch_lbs) %>%
glimpse()
sugar_ratio_all <- sum(df$sugar_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
water_ratio_all <- sum(df$water_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
dof <- (2 - 1) * (length(df$group) - 1)
df <- df %>%
mutate(sugar_expected = (sugar_lbs + water_lbs) * sugar_ratio_all,
water_expected = (sugar_lbs + water_lbs) * water_ratio_all) %>%
mutate(sugar_chi_sq = (sugar_lbs - sugar_expected)^2 / sugar_expected,
water_chi_sq = (water_lbs - water_expected)^2 / water_expected) %>%
glimpse()
q <- sum(df$sugar_chi_sq) + sum(df$water_chi_sq)
# STANFORD CALCULATION
pchisq(q, df = dof, lower.tail = F)
# PROP TEST VERSION (SAME RESULT)
prop.test(x = df$sugar_percent * df$batch_lbs, n = df$batch_lbs, alternative = "two.sided", conf.level = 0.95)$p.value

Resources