I have a table with frequencies for control and treatment group for a multinomial factor (`response'), with three levels (Negative, Neutral, Positive). I want to calculate for each levelthe difference between treatment and control, and confidence intervals, and add them to the table.
I am looking for something that can be applied to several similar frequency tables that compare treatment and control groups, where the response categories vary (e.g. unlikely, 50-50, likely).
Here is the table:
N_A <- data.frame (response = c("Negative", "Neutral", "Positive"),
n_T = c(48, 43, 42), # treatment group
n_C = c(36, 40, 51) # control group
)
I have tried to use the BinomDiffCI function from the DescTools package. I managed to write a function that runs BinomDiffCI for the first row, and extracts the lower CI.
library(DescTools)
lci.diff <- function(){
xci <- BinomDiffCI(x1 = N_A[1,2], n1 = sum(N_A[2]), x2 = N_A[1,3], n2 = sum(N_A[3]), method=c("waldcc"))
xci[,2]
}
It's not great, but maybe a start. I want to 1) add difference and upper CI, 2) do the same for all rows, 3) attach this to the dataset, and 4) apply the same to other frequency tables comparing treatment and control.
Here is the code to create the lower and upper bounds of the confidence interval
library(DescTools)
ci_diff <- function(df, i) {
tbl <- BinomDiffCI(x1 = df[i,2], n1 = sum(df[2]), x2 = df[i,3], n2 = sum(df[3]), method=c("waldcc"))
tbl[ , c("lwr.ci", "upr.ci")]
}
N_A <- cbind(N_A, t(sapply(1:nrow(N_A), \(i) ci_diff(N_A, i)))
response n_T n_C lwr.ci upr.ci
1 Negative 48 36 -0.04342071 0.1982961
2 Neutral 43 40 -0.11268594 0.1293812
3 Positive 42 51 -0.20971246 0.0381418
Related
I am trying to simulate how replacement/reassignment of values on random samples affect predictions conveyed by AUC.
I have a tumor classification in a dataframe denoted df$who which has levels 1, 2, 3 corresponding to the severity of the tumor lesion.
Intro to the question
Lets say the baseline data looks like this:
set.seed(1)
df <- data.frame(
who = as.factor(sample(1:3, size = 6000, replace = TRUE, prob = c(0.8, 0.15, 0.05))),
age = round(runif(n = 6000, min = 18, max = 95), digits = 1),
gender = sample(c("m", "f"), size = 6000, replace = TRUE, prob = c(1/3, 2/3)),
event.time = runif(n = 6000, min = 8, max = 120),
event = as.factor(sample(0:2, size = 6000, replace = TRUE, prob = c(0.25, 0.2, 0.55)))
)
And a standard cause-specific Cox regression looks like:
library(survival)
a_baseline <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df, x = TRUE)
From which AUC can be obtained as a measure of predictive performance. Here, leave-one-out bootstrap on 5-year prediction on df$event == 1.
library(riskRegression)
u <- Score(list("baseline" = a_baseline),
Surv(event.time, event == 1) ~ 1,
data = df,
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# The AUC is then obtained
u$AUC$score$AUC[2]
Question
I want to simulate how re-classifying a random 5% of df$who == 1 to dfwho == 2 affect the 5-year prediction on df$event == 1
I want to create 10 separate and simulated subsets of the baseline data df, but each containing a random allocation of 5% df$who == 1 to .. == 2. Then, I want to apply each of these 10 separate and simulated subsets to predict the 5-year risk of df$event == 1.
I have applied a for loop to this. The expected output is dataframe that tells me which of the 10 simulated datasets yielded the highest and lowest u$AUC$score$AUC[2] (i.e., the best and worst prediction).
I am new to for loop, but here is my go (that obviously did not work).
all_auc <- data.frame() ## create a dataframe to fill in AUC from all 10 simulated sub-datasets
for(i in 1:10){ #1:10 represent the simulated datasets from 1 to 10
df[i] <- df #allocating baseline data to each of the 10 datasets
df[i]$who[sample(which(df[i]$who==1), round(0.05*length(which(df[i]$who==1))))]=2 #create the random 5% allocation of who==1 to who==2 in the i'th simulated dataset
ith_cox <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df[i], x = TRUE) #create the i'th Cox regression based on the i´th dataset
# create the predictions based on the i´th Cox
u[i] <- Score(list("baseline" = ith_cox),
Surv(event.time, event == 1) ~ 1,
data = df[i],
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# summarize all AUC from all 10 sub-datasets
all_auc <- u[i]$AUC$score$AUC[2]
}
(1) I could not get this for loop to work as described, and
(2) the final dataframe all_auc should provide only which of the 10 datasets yielded the worst and best predictions (I will then use these two data sets for further analysis).
A final note
This is only a reproducible example. The for loop will be applied to 10.000 simulated datasets in our analysis. I do not know if this could affect the answer - but, it illustrates the importance of the result: a dataframe (or vector?) that simply tells me which simulated dataset yielded the best vs worst predictions, and that I subsequently will be able to use these two dataframes for furter analysis, eg df2930 and df8939.
I am trying to compare two percentages/proportions for statistical significance in R, using a Chi-Square test. I am familiar with a SAS method for Chi Square in which I supply a dataset column for a numerator, another column for denominator, and a categorical variable to distinguish distributions (A/B).
However I am getting unexpected values in R using some examples sets. When I test two similar populations, with low sample sizes, I am getting p-values of (approximately) zero, where I would expect the p-values to be very high (~ 1).
My test set is below, where I went with sugar content in a batch of water: e.g. "does group A use the same ratio of sugar as group B?". My actual problem is similar, where this isn't a pass-fail type test and the numerator and denominator values can vary wildly between samples (different sugar and/or water weights per sample). My first objective is to verify that I can get a high p-value from two similar sets. The next question is, at what sample size does the p-value become low enough to indicate significance?
# CREATE 2 NEARLY-EQUAL DISTRIBUTIONS (EXPECTING HIGH P-VALUE FROM PROP.TEST)
set.seed(108)
group_A = tibble(group = "A", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
group_B = tibble(group = "B", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
batches <- rbind(group_A, group_B)
I then do a summarize to calculate the overall sugar percentage tendency between groups:
# SUMMARY TOTALS
totals <- batches %>%
group_by(group) %>%
summarize(batch_count = n(),
batch_lbs_sum = sum(batch_lbs),
sugar_lbs_sum = sum(sugar_lbs),
sugar_percent_overall = sugar_lbs_sum / batch_lbs_sum) %>%
glimpse()
I then supply the sugar percentage between groups to a prop.test, expecting a high p-value
# ADD P-VALUE & CONFIDENCE INTERVAL
stats <- totals %>%
rowwise() %>%
summarize(p_val = prop.test(x = sugar_percent_overall, n = batch_count, conf.level = 0.95, alternative = "two.sided")$p.value) %>%
mutate(p_val = round(p_val, digits = 3)) %>%
mutate(conf_level = 1 - p_val) %>%
select(p_val, conf_level) %>%
glimpse()
# FINAL SUMMARY TABLE
cbind(totals, stats) %>%
glimpse()
Unforunately the final table gives me a p-value of 0, suggesting the two nearly-identical sets are independent/different. Shouldn't I get a p-value of ~1?
Observations: 2
Variables: 7
$ group <chr> "A", "B"
$ batch_count <int> 50, 50
$ batch_lbs_sum <dbl> 1475.579, 1475.547
$ sugar_lbs_sum <dbl> 495.4983, 484.6928
$ sugar_percent_overall <dbl> 0.3357992, 0.3284833
$ p_val <dbl> 0, 0
$ conf_level <dbl> 1, 1
From another angle, I also tried to compare the recommended sample size from power.prop.test with an actual prop.test using this recommended sample size. This gave me the reverse problem -- I was a expecting low p-value, since I am using the recommended sample size, but instead get a p-value of ~1.
# COMPARE PROP.TEST NEEDED COUNTS WITH AN ACTUAL PROP.TEXT
power.prop.test(p1 = 0.33, p2 = 0.34, sig.level = 0.10, power = 0.80, alternative = "two.sided") ## n = 38154
prop.test(x = c(0.33, 0.34), n = c(38154, 38154), conf.level = 0.90, alternative = "two.sided") ## p = 1 -- shouldn't p be < 0.10?
Am I using prop.test wrong or am I misinterpreting something? Ideally, I would prefer to skip the summarize step and simply supply the dataframe, the numerator column 'sugar_lbs', and the denominator 'batch_lbs' as I do in SAS -- is this possible in R?
(Apologies for any formatting issues as I'm new to posting)
---------------------------------
EDIT - EXAMPLE WITH ONLY PROPORTIONS & SAMPLE SIZE
I think my choice of using normal distributions may have distracted from the original question. I found an example that gets to the heart of what I was trying to ask, which is how to use prop test given only a proportion/percentage and the sample size. Instead of city_percent and city_total below, I could simply rename these to sugar_percent and batch_lbs. I think this reference answers my question, where prop.test appears to be the correct test to use.
My actual problem has an extremely non-normal distribution, but is not easily replicated via code.
STANFORD EXAMPLE (pages 37-50)
- https://web.stanford.edu/class/psych10/schedule/P10_W7L1
df <- tibble(city = c("Atlanta", "Chicago", "NY", "SF"), washed = c(1175, 1329, 1169, 1521), not_washed = c(413, 180, 334, 215)) %>%
mutate(city_total = washed + not_washed,
city_percent = washed / city_total) %>%
select(-washed, -not_washed) %>%
glimpse()
# STANFORD CALCULATION (p = 7.712265e-35)
pchisq(161.74, df = 3, lower.tail = FALSE)
# PROP TEST VERSION (SAME RESULT, p = 7.712265e-35)
prop.test(x = df$city_percent * df$city_total, n = df$city_total, alternative = "two.sided", conf.level = 0.95)$p.value
The documentation for prop.test says:
Usage prop.test(x, n, p = NULL,
alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, correct = TRUE)
Arguments
x a vector of counts of successes, a one-dimensional table with two entries, or a
two-dimensional table (or matrix) with 2 columns, giving the counts of
successes and failures, respectively.
n a vector of counts of trials; ignored if x is a matrix or a table.
So if you want a "correct" test, you would have to use sugar_lbs_sum as the x instead of sugar_percent_overall. You should still receive some kind of warning that the x is non-integral, but that's not my major concern.
But from a statistical perspective this is the complete wrong way of doing things. You are directly causing spurious correlation for a testing of difference between two quantities by dividing by their sum arbitrarily. If the samples (sugar_lbs_sum) are independent, but you divide by their sums, you have made the ratios dependent. This violates the assumptions of the statistical test in a critical way. Kronmal 1993 "Spurious correlation and the fallacy of the ratio" covers this.
The data you generated are independent normal, so don't sum them, rather test for a difference with the t-test.
The Stanford link I added to my original post answered my question. I modified the Stanford example to simply rename the variables from city to group, and washed counts to sugar_lbs. I also doubled one batch, (or comparing a small versus large city). I now get the expected high p-value (0.65) indicating that there is no statistical significance that the proportions are different.
When I add more groups (for more degrees of freedom) and continue to vary batch sizes proportionally, I continue to get high p-values as expected, confirming the recipe is the same. If I modify the sugar percent of any one group, the p-value immediately drops to zero indicating one of the groups is different, as expected.
Finally, when doing the prop.text within a 'dplyr' pipe, I found I should not have used the rowwise() step, which causes my p-values to fall to zero. Removing this step gives the correct p-value. The only downside is that I don't yet know which group is different until I compare only 2 groups at a time iteratively.
#---------------------------------------------------------
# STANFORD EXAMPLE - MODIFIED TO SUGAR & ONE DOUBLE BATCHED
#--------------------------------------------------------
df <- tibble(group = c("A", "B"), sugar_lbs = c(495.5, 484.7), water_lbs = c(1475.6 - 495.5, 1475.6 - 484.7)) %>%
mutate(sugar_lbs = ifelse(group == "B", sugar_lbs * 2, sugar_lbs),
water_lbs = ifelse(group == "B", water_lbs * 2, water_lbs)) %>%
mutate(batch_lbs = sugar_lbs + water_lbs,
sugar_percent = sugar_lbs / batch_lbs) %>%
glimpse()
sugar_ratio_all <- sum(df$sugar_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
water_ratio_all <- sum(df$water_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
dof <- (2 - 1) * (length(df$group) - 1)
df <- df %>%
mutate(sugar_expected = (sugar_lbs + water_lbs) * sugar_ratio_all,
water_expected = (sugar_lbs + water_lbs) * water_ratio_all) %>%
mutate(sugar_chi_sq = (sugar_lbs - sugar_expected)^2 / sugar_expected,
water_chi_sq = (water_lbs - water_expected)^2 / water_expected) %>%
glimpse()
q <- sum(df$sugar_chi_sq) + sum(df$water_chi_sq)
# STANFORD CALCULATION
pchisq(q, df = dof, lower.tail = F)
# PROP TEST VERSION (SAME RESULT)
prop.test(x = df$sugar_percent * df$batch_lbs, n = df$batch_lbs, alternative = "two.sided", conf.level = 0.95)$p.value
This function needs to take a data frame with three variables and four observations (a, b, c and d) and calculate (a/c) / (b/d).
for example:
df <- data.frame(female = c("White", "White", "non-White", "non-White"),
male = c("White", "non-White", "White", "non-White"),
n = c(85, 5, 5, 10))
xtabs(n ~ female + male, df)
the function would have to calculate (85 * 10) / (5 * 5) and return a result of 34.
I have previously tried this:
oddsRatio <- function(x){
x %>%
summarise(oddsratio = (n[1] * n[4]) / (n[2] * n[3]))
}
oddsRatio(df)
but this produced the answer in a table and also didn't work universally on other data frames with 4 observations and 3 variables in the way that I wanted it to.
A dplyr way of doing it
df %>%
summarize(oddsRatio = prod(n[female == male])/prod(n[female != male]))
As a function that returns a number
oddsRatio <- function(x) {
x %>%
summarize(oddsRatio = prod(n[female == male])/prod(n[female != male])) %>%
pull(oddsRatio)
}
oddsRatio(df)
# 34
Here are some ways. No packages are needed.
1) Create a logical that picks out the diagonals and then subset xt with that and with its negative.
xt <- xtabs(n ~ female + male, df)
is.diag <- row(xt) == col(xt)
prod(xt[is.diag]) / prod(xt[!is.diag])
## [1] 34
2) or pick out the diagonal and antidiagonal using indexes:
prod(xt[c(1, 4)]) / prod(xt[2:3])
## [1] 34
3) If the values of xt are known to be strictly positive then we could take the log, multiply that by c(1, -1, -1, 1), sum and take exp to get back:
exp(sum(log(xt) * c(1, -1, -1, 1)))
## [1] 34
4) If you are performing this calculation to test independence of the factors you could just directly use fisher.test . Fisher's exact test calculates the maximum likelihood estimate of the odds ratio given the table's marginals using the hypergeometric distribution. The null hypothesis is that the two factors are independent, i.e. the odds ratio equals 1, and in the example below it is rejected, i.e. the factors are not independent. Note that the confidence interval does not contain 1.
fisher.test(xt)
giving:
Fisher's Exact Test for Count Data
data: xt
p-value = 2.435e-07
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
6.951076 174.962113
sample estimates:
odds ratio
31.48572
with(df, prod(ifelse(female==male, n, 1/n)))
[1] 34
This is similar to #cnicol's solution but aviods any need for tidyverse.
The question is given like this:
Read the file diabetes.csv. There are two variables called BMI and Outcome. The variable Outcome takes on only two values: 0 and 1. Conduct a non-parametric two sample test for the hypothesis that the standard deviation of BMI is the same for both Outcome values
bmi <- diabetes$BMI
bmi
outcome <- diabetes$Outcome
outcome
n <- length(bmi)
# tstat
tstat <- ???
# Describe the population and draw synthetic samples
f1 <- function()
{
x <- c(bmi, outcome)
x <- sample(x)
m1 <- sd(x[1:n])
m2 <- sd(x[(n+1):length(x)])
return(m1 - m2)
}
# Create sampling distribution
sdist <- replicate(10000, f1())
plot(density(sdist))
# Gap
gap <- abs(mean(sdist) - tstat)
abline(v = mean(sdist) + c(-1,1) * gap, col = "dark orange")
s1 <- sdist[sdist <(mean(sdist - gap)) | sdist >(mean(sdist + gap))]
pvalue <- length(s1) / length(sdist)
pvalue
The data is in some dataset called "diabetes". My question is how to represent the "t-statistic" since the outcome is binary?
Use this code:
# Sort the table diabetes on accending order of Outcome to separate the BMI
# values with outcome = 0 and BMI values with outcome = 1
diabetes = diabetes[order(diabetes$Outcome),]
View(diabetes)
# Find the number of values with outcome = 0
n = length(which(diabetes$Outcome == 0))
# Find total number of rows
l = length(diabetes$BMI)
# Find BMI values to create the sample later on
g = diabetes$BMI
# Create function to take the values of BMI and shuffle it every time and
# to find the difference between the standard deviations
f1 = function()
{
x = sample(g)
z = abs(sd(x[1:n]) - sd(x[(n+1):l]))
return(z)
}
# Replicate the function several times
dist = replicate(100000,f1())
# Plot density of distribution
plot(density(dist))
polygon(density(dist),col="green")
diabetes0 = diabetes[diabetes$Outcome == 0,]
diabetes1 = diabetes[diabetes$Outcome == 1,]
View(diabetes0)
View(diabetes1)
# Find the difference between standard deviation of BMI when outcome = 0 and
# when outcome = 1
tstat = abs(sd(diabetes0$BMI) - sd(diabetes1$BMI))
tstat
abline(v=tstat)
rside = dist[dist>tstat]
pvalue = length(rside)/length(dist)
pvalue
I use a VAR(1) model with two variables (f,m) each with 59 observations;
I already saw R help and several books about this topic but can't figure how df2 = 108.
library(vars)
var.causal.m <- causality(ajustVAR1FM, cause = "m")
> var.causal.m
$Granger
Granger causality H0: m do not Granger-cause f
data: VAR object ajustVAR1FM
F-Test = 5.9262, df1 = 1, df2 = 108, p-value = 0.01656
If you see the package manual, it is clearly written that the test is distributed as F(pK1k2, KT-n*) where K=k1+k2 and n* equal to the total number of parameters in the above VAR(p) (including deterministic regressors). Further, for the test, the vector of endogenous variables yt is split into two subvectors y1t and y2t with dimensions (K1×1) and (K2×1) with K=K1+K2.
You can also type causality in console and see the following:
df1 <- p * length(y1.names) * length(y2.names)
df2 <- K * obs - length(PI)
Example: using Canada data
library(vars)
var.2c <- VAR(Canada, p = 2, type = "const")
causality(var.2c, cause = "e")
> dim(Canada)
[1] 84 4
Causality(var.2c, cause = "e")
$Granger
Granger causality H0: e do not Granger-cause prod rw U
data: VAR object var.2c
F-Test = 6.2768, df1 = 6, df2 = 292, p-value = 3.206e-06
Cause variable is 1 so k1=1, k2=3 (4-1) where 4 is total number of variables, T is the effective number of observations (here 84-2(lag=2))=82, n*=36 (4 equations with 9 parameters each). So, df1=2*1*3=6 and df2=4*82-36=292
Note:
In your case lag p=1,n*=8 (you estimate two models with 4 parameters in each (I suspect you also have trend so it should be 4),obs (effective 59-1 (lag p=1)) = 58, k1=1 , k2=1 and K=2. So, df1=1*1*1=1 and df2=2*58-8=108.