This function needs to take a data frame with three variables and four observations (a, b, c and d) and calculate (a/c) / (b/d).
for example:
df <- data.frame(female = c("White", "White", "non-White", "non-White"),
male = c("White", "non-White", "White", "non-White"),
n = c(85, 5, 5, 10))
xtabs(n ~ female + male, df)
the function would have to calculate (85 * 10) / (5 * 5) and return a result of 34.
I have previously tried this:
oddsRatio <- function(x){
x %>%
summarise(oddsratio = (n[1] * n[4]) / (n[2] * n[3]))
}
oddsRatio(df)
but this produced the answer in a table and also didn't work universally on other data frames with 4 observations and 3 variables in the way that I wanted it to.
A dplyr way of doing it
df %>%
summarize(oddsRatio = prod(n[female == male])/prod(n[female != male]))
As a function that returns a number
oddsRatio <- function(x) {
x %>%
summarize(oddsRatio = prod(n[female == male])/prod(n[female != male])) %>%
pull(oddsRatio)
}
oddsRatio(df)
# 34
Here are some ways. No packages are needed.
1) Create a logical that picks out the diagonals and then subset xt with that and with its negative.
xt <- xtabs(n ~ female + male, df)
is.diag <- row(xt) == col(xt)
prod(xt[is.diag]) / prod(xt[!is.diag])
## [1] 34
2) or pick out the diagonal and antidiagonal using indexes:
prod(xt[c(1, 4)]) / prod(xt[2:3])
## [1] 34
3) If the values of xt are known to be strictly positive then we could take the log, multiply that by c(1, -1, -1, 1), sum and take exp to get back:
exp(sum(log(xt) * c(1, -1, -1, 1)))
## [1] 34
4) If you are performing this calculation to test independence of the factors you could just directly use fisher.test . Fisher's exact test calculates the maximum likelihood estimate of the odds ratio given the table's marginals using the hypergeometric distribution. The null hypothesis is that the two factors are independent, i.e. the odds ratio equals 1, and in the example below it is rejected, i.e. the factors are not independent. Note that the confidence interval does not contain 1.
fisher.test(xt)
giving:
Fisher's Exact Test for Count Data
data: xt
p-value = 2.435e-07
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
6.951076 174.962113
sample estimates:
odds ratio
31.48572
with(df, prod(ifelse(female==male, n, 1/n)))
[1] 34
This is similar to #cnicol's solution but aviods any need for tidyverse.
Related
I have a table with frequencies for control and treatment group for a multinomial factor (`response'), with three levels (Negative, Neutral, Positive). I want to calculate for each levelthe difference between treatment and control, and confidence intervals, and add them to the table.
I am looking for something that can be applied to several similar frequency tables that compare treatment and control groups, where the response categories vary (e.g. unlikely, 50-50, likely).
Here is the table:
N_A <- data.frame (response = c("Negative", "Neutral", "Positive"),
n_T = c(48, 43, 42), # treatment group
n_C = c(36, 40, 51) # control group
)
I have tried to use the BinomDiffCI function from the DescTools package. I managed to write a function that runs BinomDiffCI for the first row, and extracts the lower CI.
library(DescTools)
lci.diff <- function(){
xci <- BinomDiffCI(x1 = N_A[1,2], n1 = sum(N_A[2]), x2 = N_A[1,3], n2 = sum(N_A[3]), method=c("waldcc"))
xci[,2]
}
It's not great, but maybe a start. I want to 1) add difference and upper CI, 2) do the same for all rows, 3) attach this to the dataset, and 4) apply the same to other frequency tables comparing treatment and control.
Here is the code to create the lower and upper bounds of the confidence interval
library(DescTools)
ci_diff <- function(df, i) {
tbl <- BinomDiffCI(x1 = df[i,2], n1 = sum(df[2]), x2 = df[i,3], n2 = sum(df[3]), method=c("waldcc"))
tbl[ , c("lwr.ci", "upr.ci")]
}
N_A <- cbind(N_A, t(sapply(1:nrow(N_A), \(i) ci_diff(N_A, i)))
response n_T n_C lwr.ci upr.ci
1 Negative 48 36 -0.04342071 0.1982961
2 Neutral 43 40 -0.11268594 0.1293812
3 Positive 42 51 -0.20971246 0.0381418
I am trying to do Fisher's exact test for combinations of an n x 2 dataframe and from what I have read, pairwise fishers seems to be what I want to use (see here). However, in doing so it produced p-value results that didn't look right, so I decided to manually check on combinations and got different results. I've included what I hope is a reproducible example to highlight what I've tried. Perhaps I'm doing something wrong with the R code, as I'm still relatively inexperienced, or I may be completely misunderstanding what the pairwise tests are meant to compute - if so, sorry and I can remove the question if it's not appropriate for SO.
# Packages -----------------------------------------------------------
library("tidyverse")
library("janitor")
library("RVAideMemoire")
library("fmsb")
# Generate Data -----------------------------------------------------------
set.seed(1)
test <-
tibble(
"drug" = sample(
c("Control", "Treatment1", "Treatment2"),
size = 300,
prob = c(0.1, 0.4, 0.3),
replace = TRUE),
"country" = sample(
c("Canada", "United States"),
size = 300,
prob = c(0.4, 0.6),
replace = TRUE
),
"selected" = sample(
c(0, 1),
size = 300,
prob = c(0.1, 0.65),
replace = TRUE)
)
test2 <- test %>%
filter(selected == 1)
test2_tab <- test2 %>%
tabyl(drug, country) %>%
remove_rownames() %>%
column_to_rownames(var = colnames(.[1])) %>%
as.matrix()
When I run the following pairwise tests I get this as the output (I used 2 packages just to make sure it wasn't that I just implemented one incorrectly).
# Pairwise ----------------------------------------------------------------
RVAideMemoire::fisher.multcomp(test2_tab, p.method = "bonferroni")
fmsb::pairwise.fisher.test(test2_tab, p.adjust.method = "bonferroni")
Pairwise comparisons using Fisher's exact test for count data
data: test2_tab
Control Treatment1
Treatment1 1 -
Treatment2 1 1
P value adjustment method: bonferroni
Pairwise comparisons using Pairwise comparison of proportions (Fisher)
data: test2_tab
Control Treatment1
Treatment1 1 -
Treatment2 1 1
P value adjustment method: bonferroni
However, when I create the individual tables to perform individual Fisher's test, like below, I get different results.
# Individual --------------------------------------------------------------
drug.groups2 <- unique(test2$drug)
# Just to check the correct 2x2 tables are produced
# combn(drug.groups2, 2, function(x) {
# id <- test2$drug %in% x
# cross_tabs <- table(test2$drug[id], test2$country[id])
# }, simplify = FALSE)
combn(drug.groups2, 2, function(x) {
id <- test2$drug %in% x
cross_tabs <- table(test2$drug[id], test2$country[id])
fishers <- fisher.test(cross_tabs)
fishers$data.name <-
paste(
unique(
as.character(test2$drug[id])
),collapse="-")
return(fishers)
}, simplify = FALSE)
[[1]]
Fisher's Exact Test for Count Data
data: Treatment1-Treatment2
p-value = 0.3357
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.7566901 2.4175206
sample estimates:
odds ratio
1.347105
[[2]]
Fisher's Exact Test for Count Data
data: Treatment1-Control
p-value = 0.4109
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.2560196 1.6292583
sample estimates:
odds ratio
0.6637235
[[3]]
Fisher's Exact Test for Count Data
data: Treatment2-Control
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.3294278 2.3146386
sample estimates:
odds ratio
0.8940101
Isn't it due to Bonferroni correction which is applied to pairwise comparisons while is not applied to individual tests?
As clearly pointed out in the comments by Lukasz and StupidWolf, I had forgotten that I had applied the p.method = "bonferroni" correction, and the results are the same with the function call p.method = "none" ...
I am trying to compare two percentages/proportions for statistical significance in R, using a Chi-Square test. I am familiar with a SAS method for Chi Square in which I supply a dataset column for a numerator, another column for denominator, and a categorical variable to distinguish distributions (A/B).
However I am getting unexpected values in R using some examples sets. When I test two similar populations, with low sample sizes, I am getting p-values of (approximately) zero, where I would expect the p-values to be very high (~ 1).
My test set is below, where I went with sugar content in a batch of water: e.g. "does group A use the same ratio of sugar as group B?". My actual problem is similar, where this isn't a pass-fail type test and the numerator and denominator values can vary wildly between samples (different sugar and/or water weights per sample). My first objective is to verify that I can get a high p-value from two similar sets. The next question is, at what sample size does the p-value become low enough to indicate significance?
# CREATE 2 NEARLY-EQUAL DISTRIBUTIONS (EXPECTING HIGH P-VALUE FROM PROP.TEST)
set.seed(108)
group_A = tibble(group = "A", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
group_B = tibble(group = "B", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
batches <- rbind(group_A, group_B)
I then do a summarize to calculate the overall sugar percentage tendency between groups:
# SUMMARY TOTALS
totals <- batches %>%
group_by(group) %>%
summarize(batch_count = n(),
batch_lbs_sum = sum(batch_lbs),
sugar_lbs_sum = sum(sugar_lbs),
sugar_percent_overall = sugar_lbs_sum / batch_lbs_sum) %>%
glimpse()
I then supply the sugar percentage between groups to a prop.test, expecting a high p-value
# ADD P-VALUE & CONFIDENCE INTERVAL
stats <- totals %>%
rowwise() %>%
summarize(p_val = prop.test(x = sugar_percent_overall, n = batch_count, conf.level = 0.95, alternative = "two.sided")$p.value) %>%
mutate(p_val = round(p_val, digits = 3)) %>%
mutate(conf_level = 1 - p_val) %>%
select(p_val, conf_level) %>%
glimpse()
# FINAL SUMMARY TABLE
cbind(totals, stats) %>%
glimpse()
Unforunately the final table gives me a p-value of 0, suggesting the two nearly-identical sets are independent/different. Shouldn't I get a p-value of ~1?
Observations: 2
Variables: 7
$ group <chr> "A", "B"
$ batch_count <int> 50, 50
$ batch_lbs_sum <dbl> 1475.579, 1475.547
$ sugar_lbs_sum <dbl> 495.4983, 484.6928
$ sugar_percent_overall <dbl> 0.3357992, 0.3284833
$ p_val <dbl> 0, 0
$ conf_level <dbl> 1, 1
From another angle, I also tried to compare the recommended sample size from power.prop.test with an actual prop.test using this recommended sample size. This gave me the reverse problem -- I was a expecting low p-value, since I am using the recommended sample size, but instead get a p-value of ~1.
# COMPARE PROP.TEST NEEDED COUNTS WITH AN ACTUAL PROP.TEXT
power.prop.test(p1 = 0.33, p2 = 0.34, sig.level = 0.10, power = 0.80, alternative = "two.sided") ## n = 38154
prop.test(x = c(0.33, 0.34), n = c(38154, 38154), conf.level = 0.90, alternative = "two.sided") ## p = 1 -- shouldn't p be < 0.10?
Am I using prop.test wrong or am I misinterpreting something? Ideally, I would prefer to skip the summarize step and simply supply the dataframe, the numerator column 'sugar_lbs', and the denominator 'batch_lbs' as I do in SAS -- is this possible in R?
(Apologies for any formatting issues as I'm new to posting)
---------------------------------
EDIT - EXAMPLE WITH ONLY PROPORTIONS & SAMPLE SIZE
I think my choice of using normal distributions may have distracted from the original question. I found an example that gets to the heart of what I was trying to ask, which is how to use prop test given only a proportion/percentage and the sample size. Instead of city_percent and city_total below, I could simply rename these to sugar_percent and batch_lbs. I think this reference answers my question, where prop.test appears to be the correct test to use.
My actual problem has an extremely non-normal distribution, but is not easily replicated via code.
STANFORD EXAMPLE (pages 37-50)
- https://web.stanford.edu/class/psych10/schedule/P10_W7L1
df <- tibble(city = c("Atlanta", "Chicago", "NY", "SF"), washed = c(1175, 1329, 1169, 1521), not_washed = c(413, 180, 334, 215)) %>%
mutate(city_total = washed + not_washed,
city_percent = washed / city_total) %>%
select(-washed, -not_washed) %>%
glimpse()
# STANFORD CALCULATION (p = 7.712265e-35)
pchisq(161.74, df = 3, lower.tail = FALSE)
# PROP TEST VERSION (SAME RESULT, p = 7.712265e-35)
prop.test(x = df$city_percent * df$city_total, n = df$city_total, alternative = "two.sided", conf.level = 0.95)$p.value
The documentation for prop.test says:
Usage prop.test(x, n, p = NULL,
alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, correct = TRUE)
Arguments
x a vector of counts of successes, a one-dimensional table with two entries, or a
two-dimensional table (or matrix) with 2 columns, giving the counts of
successes and failures, respectively.
n a vector of counts of trials; ignored if x is a matrix or a table.
So if you want a "correct" test, you would have to use sugar_lbs_sum as the x instead of sugar_percent_overall. You should still receive some kind of warning that the x is non-integral, but that's not my major concern.
But from a statistical perspective this is the complete wrong way of doing things. You are directly causing spurious correlation for a testing of difference between two quantities by dividing by their sum arbitrarily. If the samples (sugar_lbs_sum) are independent, but you divide by their sums, you have made the ratios dependent. This violates the assumptions of the statistical test in a critical way. Kronmal 1993 "Spurious correlation and the fallacy of the ratio" covers this.
The data you generated are independent normal, so don't sum them, rather test for a difference with the t-test.
The Stanford link I added to my original post answered my question. I modified the Stanford example to simply rename the variables from city to group, and washed counts to sugar_lbs. I also doubled one batch, (or comparing a small versus large city). I now get the expected high p-value (0.65) indicating that there is no statistical significance that the proportions are different.
When I add more groups (for more degrees of freedom) and continue to vary batch sizes proportionally, I continue to get high p-values as expected, confirming the recipe is the same. If I modify the sugar percent of any one group, the p-value immediately drops to zero indicating one of the groups is different, as expected.
Finally, when doing the prop.text within a 'dplyr' pipe, I found I should not have used the rowwise() step, which causes my p-values to fall to zero. Removing this step gives the correct p-value. The only downside is that I don't yet know which group is different until I compare only 2 groups at a time iteratively.
#---------------------------------------------------------
# STANFORD EXAMPLE - MODIFIED TO SUGAR & ONE DOUBLE BATCHED
#--------------------------------------------------------
df <- tibble(group = c("A", "B"), sugar_lbs = c(495.5, 484.7), water_lbs = c(1475.6 - 495.5, 1475.6 - 484.7)) %>%
mutate(sugar_lbs = ifelse(group == "B", sugar_lbs * 2, sugar_lbs),
water_lbs = ifelse(group == "B", water_lbs * 2, water_lbs)) %>%
mutate(batch_lbs = sugar_lbs + water_lbs,
sugar_percent = sugar_lbs / batch_lbs) %>%
glimpse()
sugar_ratio_all <- sum(df$sugar_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
water_ratio_all <- sum(df$water_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
dof <- (2 - 1) * (length(df$group) - 1)
df <- df %>%
mutate(sugar_expected = (sugar_lbs + water_lbs) * sugar_ratio_all,
water_expected = (sugar_lbs + water_lbs) * water_ratio_all) %>%
mutate(sugar_chi_sq = (sugar_lbs - sugar_expected)^2 / sugar_expected,
water_chi_sq = (water_lbs - water_expected)^2 / water_expected) %>%
glimpse()
q <- sum(df$sugar_chi_sq) + sum(df$water_chi_sq)
# STANFORD CALCULATION
pchisq(q, df = dof, lower.tail = F)
# PROP TEST VERSION (SAME RESULT)
prop.test(x = df$sugar_percent * df$batch_lbs, n = df$batch_lbs, alternative = "two.sided", conf.level = 0.95)$p.value
I constructed a binary logistic model. The response variable is binary. There are 4 regressors - 2 binary and 2 integers. I want to find the outliers and delete them. For this i have create some plots:
par(mfrow = c(2,2))
plot(hat.ep,rstudent.ep,col="#E69F00", main="hat-values versus studentized residuals",
xlab="Hat value", ylab="Studentized residual")
dffits.ep <- dffits(model_logit)
plot(id,dffits.ep,type="l", col="#E69F00", main="Index Plot",
xlab="Identification", ylab="Diffits")
cov.ep <- covratio(model_logit)
plot(id,cov.ep,type="l",col="#E69F00", main="Covariance Ratio",
xlab="Identification", ylab="Covariance Ratio")
cook.ep <- cooks.distance(model_logit)
plot(id,cook.ep,type="l",col="#E69F00", main="Cook's Distance",
xlab="Identification", ylab="Cook's Distance")
According to the plots there is an outlier. How can I identify which observation is the outlier?
I have tried :
> outlierTest(model_logit)
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferonni p
1061 1.931043 0.053478 NA
Are there some other functions for outlier detection?
Well this answer comes quite late. I'm unsure if you have found the answer or not. Continuing further, in the absence of a minimum reproducible example, I'll attempt to answer the question using some dummy data and two custom functions. For a given continuous variable, outliers are those observations that lie outside of 1.5*IQR, where IQR, the ‘Inter Quartile Range’ is the difference between the 75th and 25th quartiles. I also recommend you to see this post containing far better solutions than my crude answer.
> df <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y = c(runif(1000),rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
> head(df)
X Y Z
1 NA 0.8651 0.2784
2 -0.06838 0.4700 2.0483
3 -0.18734 0.9887 1.8353
4 -0.05015 0.7731 2.4464
5 0.25010 0.9941 1.3979
6 -0.26664 0.6778 1.1277
> boxplot(df$Y) # notice the outliers above the top whisker
Now, I'll create a custom function to detect the outliers and the other function will replace the outlier values with NA.
# this function will return the indices of the outlier values
> findOutlier <- function(data, cutoff = 3) {
## Calculate the sd
sds <- apply(data, 2, sd, na.rm = TRUE)
## Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
# check for outliers
> outliers <- findOutlier(df)
# custom function to remove outliers
> removeOutlier <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
> filterData<- removeOutlier(df, outliers)
> boxplot(filterData$Y)
The question is given like this:
Read the file diabetes.csv. There are two variables called BMI and Outcome. The variable Outcome takes on only two values: 0 and 1. Conduct a non-parametric two sample test for the hypothesis that the standard deviation of BMI is the same for both Outcome values
bmi <- diabetes$BMI
bmi
outcome <- diabetes$Outcome
outcome
n <- length(bmi)
# tstat
tstat <- ???
# Describe the population and draw synthetic samples
f1 <- function()
{
x <- c(bmi, outcome)
x <- sample(x)
m1 <- sd(x[1:n])
m2 <- sd(x[(n+1):length(x)])
return(m1 - m2)
}
# Create sampling distribution
sdist <- replicate(10000, f1())
plot(density(sdist))
# Gap
gap <- abs(mean(sdist) - tstat)
abline(v = mean(sdist) + c(-1,1) * gap, col = "dark orange")
s1 <- sdist[sdist <(mean(sdist - gap)) | sdist >(mean(sdist + gap))]
pvalue <- length(s1) / length(sdist)
pvalue
The data is in some dataset called "diabetes". My question is how to represent the "t-statistic" since the outcome is binary?
Use this code:
# Sort the table diabetes on accending order of Outcome to separate the BMI
# values with outcome = 0 and BMI values with outcome = 1
diabetes = diabetes[order(diabetes$Outcome),]
View(diabetes)
# Find the number of values with outcome = 0
n = length(which(diabetes$Outcome == 0))
# Find total number of rows
l = length(diabetes$BMI)
# Find BMI values to create the sample later on
g = diabetes$BMI
# Create function to take the values of BMI and shuffle it every time and
# to find the difference between the standard deviations
f1 = function()
{
x = sample(g)
z = abs(sd(x[1:n]) - sd(x[(n+1):l]))
return(z)
}
# Replicate the function several times
dist = replicate(100000,f1())
# Plot density of distribution
plot(density(dist))
polygon(density(dist),col="green")
diabetes0 = diabetes[diabetes$Outcome == 0,]
diabetes1 = diabetes[diabetes$Outcome == 1,]
View(diabetes0)
View(diabetes1)
# Find the difference between standard deviation of BMI when outcome = 0 and
# when outcome = 1
tstat = abs(sd(diabetes0$BMI) - sd(diabetes1$BMI))
tstat
abline(v=tstat)
rside = dist[dist>tstat]
pvalue = length(rside)/length(dist)
pvalue