Perform multiple chi-squared test on dataframe based on column value - r

I have a data frame with count numbers and I want to perform a chisq.test for each value of the variable Cluster. So basically, I need 4 contingency tables (for "A","B","C","D") where rows = Category, columns = Drug, value = Total. And subsequently a chisq.test should be run for all 4 tabels.
Example data frame
df <- data.frame(Cluster = c(rep("A",8),rep("B",8),rep("C",8),rep("D",8)),
Category = rep(c(rep("0-1",2),rep("2-4",2),rep("5-12",2),rep(">12",2)),2),
Drug = rep(c("drug X","drug Y"),16),
Total = as.numeric(sample(20:200,32,replace=TRUE)))

Firstly, use xtabs() to produce stratified contingency tables.
tab <- xtabs(Total ~ Category + Drug + Cluster, df)
tab
# , , Cluster = A
#
# Drug
# Category drug X drug Y
# >12 92 75
# 0-1 33 146
# 2-4 193 95
# 5-12 76 195
#
# etc.
Then use apply() to conduct a Pearson's Chi-squared test over each stratum.
apply(tab, 3, chisq.test)
# $A
#
# Pearson's Chi-squared test
#
# data: array(newX[, i], d.call, dn.call)
# X-squared = 145.98, df = 3, p-value < 2.2e-16
#
# etc.
Furthermore, you can perform a Cochran-Mantel-Haenszel chi-squared test for conditional independence.
mantelhaen.test(tab)
# Cochran-Mantel-Haenszel test
#
# data: tab
# Cochran-Mantel-Haenszel M^2 = 59.587, df = 3, p-value = 7.204e-13

Related

How to use apply() on an array of matrices in R?

I am trying to use apply() on an array of matrices.
Here is an example:
data(UCBAdmissions)
fisher.test(UCBAdmissions[,,1]) #This works great
apply(UCBAdmissions, c(1,2,3), fisher.test) #This fails
The UCBAdmissions data has six contingency table data within Dept part, namely: "A", "B", "C" , "D", "E" , and "F".
dimnames(UCBAdmissions)
#$Admit
#[1] "Admitted" "Rejected"
#$Gender
#[1] "Male" "Female"
#$Dept
#[1] "A" "B" "C" "D" "E" "F"
You can apply fisher.test to each of these six tables. It is not clear to me from your code apply(UCBAdmissions, c(1,2,3), fisher.test) to which part of the six tables you want to apply fisher.test.
If you want to apply fisher.test to the first three of the six tables, namely "A", "B", and "C", you need to subset the UCBAdmissions data first, and then set the dimension to 3.
apply(UCBAdmissions[,,1:3], 3, fisher.test)
# $A
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value = 1.669e-05
# alternative hypothesis: true odds ratio is not equal to 1
# 95 percent confidence interval:
# 0.1970420 0.5920417
# sample estimates:
# odds ratio
# 0.3495628
#
#
# $B
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value = 0.6771
# alternative hypothesis: true odds ratio is not equal to 1
# 95 percent confidence interval:
# 0.2944986 2.0040231
# sample estimates:
# odds ratio
# 0.8028124
#
#
# $C
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value = 0.3866
# alternative hypothesis: true odds ratio is not equal to 1
# 95 percent confidence interval:
# 0.8452173 1.5162918
# sample estimates:
# odds ratio
# 1.1329
Another option is to replace 3 with the dimension name:
apply(UCBAdmissions[,,1:3], "Dept", fisher.test)
This will give exactly the same result as the previous code.
In another case, if you want to apply fisher.test to contingency tables between Admit and Dept for "A", "B", "C", grouped by Gender, you can use:
apply(UCBAdmissions[,,1:3], "Gender", fisher.test)
# $Male
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value = 7.217e-16
# alternative hypothesis: two.sided
#
#
# $Female
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value < 2.2e-16
# alternative hypothesis: two.sided
To show the part being tested more clearly, I reshape the data and then filter it so that I have male only students in depts A, B, and C. Then, I apply fisher.test to the data
DF <- UCBAdmissions %>%
as.data.frame %>%
filter(Gender == "Male",
Dept == "A" | Dept == "B" | Dept == "C") %>%
pivot_wider(-Gender, names_from = Admit, values_from = Freq)
DF
# # A tibble: 3 x 3
# Dept Admitted Rejected
# <fct> <dbl> <dbl>
# 1 A 512 313
# 2 B 353 207
# 3 C 120 205
fisher.test(DF[1:3, 2:3])
#
# Fisher's Exact Test for Count Data
#
# data: DF[1:3, 2:3]
# p-value = 7.217e-16
# alternative hypothesis: two.sided
The result is exactly the same as the one resulted from apply(UCBAdmissions[,,1:3], "Gender", fisher.test) for Male group.
Something like this:
Personally I do it this way:
First make a list UCB_list
then bind list elements to dataframe with rbindlist from data.table
finally, use lapply indicating the column y=df$Gender you want to iterate through:
library(data.table)
UCB_list <- list(UCBAdmissions)
df <- rbindlist(lapply(UCB_list, data.frame))
lapply(df, fisher.test, y = df$Gender)
> lapply(df, fisher.test, y = df$Gender)
$Admit
Fisher's Exact Test for Count Data
data: X[[i]] and df$Gender
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.1537975 6.5020580
sample estimates:
odds ratio
1
$Gender
Fisher's Exact Test for Count Data
data: X[[i]] and df$Gender
p-value = 7.396e-07
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
16.56459 Inf
sample estimates:
odds ratio
Inf
$Dept
Fisher's Exact Test for Count Data
data: X[[i]] and df$Gender
p-value = 1
alternative hypothesis: two.sided
$Freq
Fisher's Exact Test for Count Data
data: X[[i]] and df$Gender
p-value = 0.4783
alternative hypothesis: two.sided

Having trouble running friedman test on ordinal data in R

I am attempting to run a friedman test on ordinal data in R and am getting errors. The data can be found here on dropbox https://www.dropbox.com/s/gh8crh18y1ueriy/seltoutput.xlsx?dl=0.
As a description of the data:
group1: group assignments, 2 levels
time1: time points, 2 levels
loameasure: ordinal data, 5 levels
distmeasure: continuous data
vectemp: participant IDs
After importing the data I run the following to correctly format:
selt$loameasure<-factor(selt$loameasure)
selt$distmeasure<-as.numeric(selt$distmeasure)
selt$time1<-factor(selt$time1)
Then I run:
friedman_test(formula = loameasure ~ time1 | vectemp, data = selt)
Then I get the error:
Error in friedman.test.default(c(3L, 2L, 3L, 2L, 2L, 5L, 2L, 1L, 3L, 4L, :
not an unreplicated complete block design
I thought that loameasure and time1 had to be factors but I did try them as numeric and I get a similar error:
Error in friedman.test.default(c(3, 2, 3, 2, 2, 5, 2, 1, 3, 4, 2, 2, 4, :
not an unreplicated complete block design
I've been playing around with this for days and haven't been able to figure out what my problem is. I would love some assistance! Thank you in advance!
As far as I can anticipate a Friedman test is not appropriate in your situation. I would suggest to perform a two-way ANOVA test for unbalanced designs with Type-III sums of square method.
The assumptions of Normality of residuals and homogenity are given.
I have tried to guide you how to perform the test and the meaning of some steps. It is not complete (lacking of interpretation etc..) But this should be a begin and direction for you.
We want to know if loameasure depends on group1 and time1
We will perform a two-way anova with two factors
Dependent variable: loameasure
Independent variable: group1 and time1
library(readxl)
# load your data
df <- read_excel("C:/Users/coding/Downloads/seltoutput.xlsx",
col_types = c("numeric", "numeric", "numeric"))
# Prepare data
# group1 to factor
df$group1 <- factor(df$group1,
levels = c(0, 1),
labels = c("Group_0", "Group_1"))
# time1 to factor
df$time1 <- factor(df$time1,
levels = c(1, 2),
labels = c("Time_1", "Time_2"))
----------------------------------------------------------------------------
# Visualize
library("ggpubr")
ggboxplot(df, x = "time1", y = "loameasure", color = "group1",
palette = c("#00AFBB", "#E7B800"))
ggline(df, x = "time1", y = "loameasure", color = "group1",
add = c("mean_se", "dotplot"),
palette = c("#00AFBB", "#E7B800"))
-----------------------------------------------------------------------------
# first decide if balanced or unbalnced design
table(df$group1, df$time1)
# Output
# Time_1 Time_2
# Group_0 20 20
# Group_1 29 29
# Here it is a unbalance design
# An unbalanced design has unequal numbers of subjects in each group!
## We will perform two-way ANOVA test in R for unbalanced designs !!!!!!!!!!!
# The recommended method are the Type-III sums of squares.
# you need `car` package
library(car)
# Our 2 way anova of unbalanced design (SS Type-III)
df_anova <- aov(loameasure ~ group1 * time1, data = df)
Anova(df_anova, type = "III")
## Output
# Anova Table (Type III tests)
# Response: loameasure
# Sum Sq Df F value Pr(>F)
# (Intercept) 120.050 1 83.9312 1.116e-14 ***
# group1 0.700 1 0.4891 0.48607
# time1 62.500 1 43.6960 2.301e-09 ***
# group1:time1 5.716 1 3.9963 0.04849 *
# Residuals 134.452 94
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Nomrality check ---------------------------------------------------------
# call residuals (difference between each indivdual and their group1/time1 combination mean)
res <- df_anova$residuals
# Histogram of residuals: Residuals should be normally distributed
hist(res,main="Histogram of residuals", xlab = "Residuals")
# # Extract the residuals
# Run Shapiro-Wilk test
shapiro.test(x = res )
# Output
# data: res
# W = 0.97708, p-value = 0.08434
# P is > 0.05 therefore normality can be assumed.
# Homogenity test ---------------------------------------------------------
# Levene's test for equality of variances (in `car` package)
library(car)
leveneTest(loameasure~ time1 * group1,data=df)
# Output:
# Levene's Test for Homogeneity of Variance (center = median)
# Df F value Pr(>F)
# group 3 0.3196 0.8112
# 94
# P is > 0.05 therefore equal variances can be assumed.

R: How to create result table of the results of multiple statistics tests?

I am a complete beginner in R.
I ran multiple Chi-square tests on a column of data in R with this code:
apply(mydata, 2, chisq.test, p=expected.probability)
and got multiple results like this:
$Primary Tumor
Chi-squared test for given probabilities
data: newX[, i] X-squared = 515108, df = 6, p-value < 2.2e-16
$Primary Tumor_1
Chi-squared test for given probabilities
data: newX[, i] X-squared = 583205, df = 6, p-value < 2.2e-16
$Primary Tumor_2
Chi-squared test for given probabilities
data: newX[, i] X-squared = 58089, df = 6, p-value < 2.2e-16
Can extract a results table with Tumour number, x-squared results, df and p-value of 50 samples I tested?
I can copy and paste in excel but I wanna learn code for larger sample.
Thank you:)
You can see are the names of the chisq test:
names(chisq.test(matrix(1:4,ncol=2)))
[1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
[7] "expected" "residuals" "stdres"
The values you need are statistic (chisq), parameter (df), p.value.
So we simulate data:
mydata = matrix(rpois(100,50),ncol=10)
colnames(mydata) = paste0("tumor",1:10)
And write a more elaborate function to take out these parameters after the test
res = apply(mydata,2,function(x){
chisq.test(x,p=rep(0.1,10))[c("statistic","parameter","p.value")]
})
And we make it a data.frame:
df = data.frame(id=names(res),do.call(rbind,res))
df
id statistic parameter p.value
tumor1 tumor1 4.322896 9 0.8889048
tumor2 tumor2 5.285714 9 0.8087245
tumor3 tumor3 2.803063 9 0.9715936
tumor4 tumor4 8.62578 9 0.4725097
tumor5 tumor5 13.22846 9 0.1525381
tumor6 tumor6 8.653768 9 0.4698283
tumor7 tumor7 7.666667 9 0.5680554
tumor8 tumor8 5.919132 9 0.7479838
tumor9 tumor9 8.051335 9 0.5289813
tumor10 tumor10 13.46875 9 0.1425173
try this
df <- apply(mydata, 2, chisq.test, p=expected.probability)
just assigns it to a variable that can be accessed from your environment... check this question out too it may help you as well. chi square test for each row in data frame

How to run cor.test with nested loops

I am trying to run a correlation test on different data frames representing the number of unique stores an employee is assigned and columns repenting different regions simultaneously. My data frame is split by the number of unique stores each employee has by:
unique_store_breakdown <- split(Data, as.factor(Data$unique_stores))
Ideally I would like the output:
Region -- unique_store -- correlation
Midwest ------- 1 -------------- .05
Midwest ------- 2 -------------- .04
.
.
Southeast ----- 1 ------------- 0.75
.
.
cor_tests <-list()
counter = 0
for (i in unique(j$region)){
for (j in 1: length(unique_store_breakdown)){
counter = counter + 1
#Create new variables for correlation test
x = as.numeric(j[j$region == i,]$quality)
y = as.numeric(j[j$region == i,]$rsv)
cor_tests[[counter]] <- cor.test(x,y)
}}
cor_tests
I am able to run this for one dataframe at a time, but when I try to add the nested loop (j term) I receive the error "Error: $ operator is invalid for atomic vectors. Additionally I would also like to output the results as a dataframe rather than a list if possible.
If all you want to do is perform cor.test() for each store, it should be fairly simple using by(). The output from by() is a regular list, it's just the printing that is a little special.
# example data
set.seed(1)
dtf <- data.frame(store=rep(1:3, each=30), rsv=rnorm(90))
dtf$quality <- dtf$rsv + rnorm(90, 0, dtf$store)
# perform cor.test for every store
by(dtf, dtf$store, function(x) cor.test(x$quality, x$rsv))
# dtf$store: 1
#
# Pearson's product-moment correlation
#
# data: x$quality and x$rsv
# t = 5.5485, df = 28, p-value = 6.208e-06
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.4915547 0.8597796
# sample estimates:
# cor
# 0.7236681
#
# ------------------------------------------------------------------------------
# dtf$store: 2
#
# Pearson's product-moment correlation
#
# data: x$quality and x$rsv
# t = 0.68014, df = 28, p-value = 0.502
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.2439893 0.4663368
# sample estimates:
# cor
# 0.1274862
#
# ------------------------------------------------------------------------------
# dtf$store: 3
#
# Pearson's product-moment correlation
#
# data: x$quality and x$rsv
# t = 2.2899, df = 28, p-value = 0.02977
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.04304952 0.66261810
# sample estimates:
# cor
# 0.397159
#

T test in a cross tabulated table

I have a table like so:
1 2 3 4 5
10 22 15 14 3
15 44 22 26 9
...more rows
I want to run a t test on a single row to find out if it's plausible that its mean is less than 3. Using t.test(table[x, ]) doesn't work, because it assumes I'm interested in the mean of the values in the row, which I'm not: the values just indicate the number of responses to each value on a scale of 1-5.
How do?
You could use the following approach:
Ungroup you data
Apply the t.test to each row
apply(data, 1, function(data) {t.test( rep(1:5, times = data), alternative = "less", mu = 3)})
Which will return a t-test for each row, e.g.:
[[1]]
One Sample t-test
data: rep(1:5, times = data)
t = -2.4337, df = 63, p-value = 0.008896
alternative hypothesis: true mean is less than 3
95 percent confidence interval:
-Inf 2.892043
sample estimates:
mean of x
2.65625
[[2]]
One Sample t-test
data: rep(1:5, times = data)
t = -2.3745, df = 115, p-value = 0.009613
alternative hypothesis: true mean is less than 3
95 percent confidence interval:
-Inf 2.921981
sample estimates:
mean of x
2.741379
If you want just the p-values then add $p.value:
apply(data, 1, function(data) {t.test( rep(1:5, times = data), alternative = "less", mu = 3)$p.value})
[1] 0.008895887 0.009613075

Resources