I am trying to run a correlation test on different data frames representing the number of unique stores an employee is assigned and columns repenting different regions simultaneously. My data frame is split by the number of unique stores each employee has by:
unique_store_breakdown <- split(Data, as.factor(Data$unique_stores))
Ideally I would like the output:
Region -- unique_store -- correlation
Midwest ------- 1 -------------- .05
Midwest ------- 2 -------------- .04
.
.
Southeast ----- 1 ------------- 0.75
.
.
cor_tests <-list()
counter = 0
for (i in unique(j$region)){
for (j in 1: length(unique_store_breakdown)){
counter = counter + 1
#Create new variables for correlation test
x = as.numeric(j[j$region == i,]$quality)
y = as.numeric(j[j$region == i,]$rsv)
cor_tests[[counter]] <- cor.test(x,y)
}}
cor_tests
I am able to run this for one dataframe at a time, but when I try to add the nested loop (j term) I receive the error "Error: $ operator is invalid for atomic vectors. Additionally I would also like to output the results as a dataframe rather than a list if possible.
If all you want to do is perform cor.test() for each store, it should be fairly simple using by(). The output from by() is a regular list, it's just the printing that is a little special.
# example data
set.seed(1)
dtf <- data.frame(store=rep(1:3, each=30), rsv=rnorm(90))
dtf$quality <- dtf$rsv + rnorm(90, 0, dtf$store)
# perform cor.test for every store
by(dtf, dtf$store, function(x) cor.test(x$quality, x$rsv))
# dtf$store: 1
#
# Pearson's product-moment correlation
#
# data: x$quality and x$rsv
# t = 5.5485, df = 28, p-value = 6.208e-06
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.4915547 0.8597796
# sample estimates:
# cor
# 0.7236681
#
# ------------------------------------------------------------------------------
# dtf$store: 2
#
# Pearson's product-moment correlation
#
# data: x$quality and x$rsv
# t = 0.68014, df = 28, p-value = 0.502
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.2439893 0.4663368
# sample estimates:
# cor
# 0.1274862
#
# ------------------------------------------------------------------------------
# dtf$store: 3
#
# Pearson's product-moment correlation
#
# data: x$quality and x$rsv
# t = 2.2899, df = 28, p-value = 0.02977
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.04304952 0.66261810
# sample estimates:
# cor
# 0.397159
#
Related
I am trying to use apply() on an array of matrices.
Here is an example:
data(UCBAdmissions)
fisher.test(UCBAdmissions[,,1]) #This works great
apply(UCBAdmissions, c(1,2,3), fisher.test) #This fails
The UCBAdmissions data has six contingency table data within Dept part, namely: "A", "B", "C" , "D", "E" , and "F".
dimnames(UCBAdmissions)
#$Admit
#[1] "Admitted" "Rejected"
#$Gender
#[1] "Male" "Female"
#$Dept
#[1] "A" "B" "C" "D" "E" "F"
You can apply fisher.test to each of these six tables. It is not clear to me from your code apply(UCBAdmissions, c(1,2,3), fisher.test) to which part of the six tables you want to apply fisher.test.
If you want to apply fisher.test to the first three of the six tables, namely "A", "B", and "C", you need to subset the UCBAdmissions data first, and then set the dimension to 3.
apply(UCBAdmissions[,,1:3], 3, fisher.test)
# $A
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value = 1.669e-05
# alternative hypothesis: true odds ratio is not equal to 1
# 95 percent confidence interval:
# 0.1970420 0.5920417
# sample estimates:
# odds ratio
# 0.3495628
#
#
# $B
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value = 0.6771
# alternative hypothesis: true odds ratio is not equal to 1
# 95 percent confidence interval:
# 0.2944986 2.0040231
# sample estimates:
# odds ratio
# 0.8028124
#
#
# $C
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value = 0.3866
# alternative hypothesis: true odds ratio is not equal to 1
# 95 percent confidence interval:
# 0.8452173 1.5162918
# sample estimates:
# odds ratio
# 1.1329
Another option is to replace 3 with the dimension name:
apply(UCBAdmissions[,,1:3], "Dept", fisher.test)
This will give exactly the same result as the previous code.
In another case, if you want to apply fisher.test to contingency tables between Admit and Dept for "A", "B", "C", grouped by Gender, you can use:
apply(UCBAdmissions[,,1:3], "Gender", fisher.test)
# $Male
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value = 7.217e-16
# alternative hypothesis: two.sided
#
#
# $Female
#
# Fisher's Exact Test for Count Data
#
# data: array(newX[, i], d.call, dn.call)
# p-value < 2.2e-16
# alternative hypothesis: two.sided
To show the part being tested more clearly, I reshape the data and then filter it so that I have male only students in depts A, B, and C. Then, I apply fisher.test to the data
DF <- UCBAdmissions %>%
as.data.frame %>%
filter(Gender == "Male",
Dept == "A" | Dept == "B" | Dept == "C") %>%
pivot_wider(-Gender, names_from = Admit, values_from = Freq)
DF
# # A tibble: 3 x 3
# Dept Admitted Rejected
# <fct> <dbl> <dbl>
# 1 A 512 313
# 2 B 353 207
# 3 C 120 205
fisher.test(DF[1:3, 2:3])
#
# Fisher's Exact Test for Count Data
#
# data: DF[1:3, 2:3]
# p-value = 7.217e-16
# alternative hypothesis: two.sided
The result is exactly the same as the one resulted from apply(UCBAdmissions[,,1:3], "Gender", fisher.test) for Male group.
Something like this:
Personally I do it this way:
First make a list UCB_list
then bind list elements to dataframe with rbindlist from data.table
finally, use lapply indicating the column y=df$Gender you want to iterate through:
library(data.table)
UCB_list <- list(UCBAdmissions)
df <- rbindlist(lapply(UCB_list, data.frame))
lapply(df, fisher.test, y = df$Gender)
> lapply(df, fisher.test, y = df$Gender)
$Admit
Fisher's Exact Test for Count Data
data: X[[i]] and df$Gender
p-value = 1
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.1537975 6.5020580
sample estimates:
odds ratio
1
$Gender
Fisher's Exact Test for Count Data
data: X[[i]] and df$Gender
p-value = 7.396e-07
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
16.56459 Inf
sample estimates:
odds ratio
Inf
$Dept
Fisher's Exact Test for Count Data
data: X[[i]] and df$Gender
p-value = 1
alternative hypothesis: two.sided
$Freq
Fisher's Exact Test for Count Data
data: X[[i]] and df$Gender
p-value = 0.4783
alternative hypothesis: two.sided
I have a (big)dataset which looks like this:-
dat <- data.frame(m=c(rep("a",4),rep("b",3),rep("c",2)),
n1 =round(rnorm(mean = 20,sd = 10,n = 9)))
g <- rnorm(20,10,5)
dat
m n1
1 a 15.132
2 a 17.723
3 a 3.958
4 a 19.239
5 b 11.417
6 b 12.583
7 b 32.946
8 c 11.970
9 c 26.447
I want to perform a t-test on each category of "m" with vectorg like
n1.a <- c(15.132,17.723,3.958,19.329)
I need to do a t-test like t.test(n1.a,g)
I initially thought about breaking them up into list using split(dat,dat$m) and
then use lapply, but it is not working .
Any thoughts on how to go about it ?
Here's a tidyverse solution using map from purrr:
dat %>%
split(.$m) %>%
map(~ t.test(.x$n1, g), data = .x$n1)
Or, using lapply as you mentioned, which will store all of your t-test statistics in a list (or a shorter version using by, thanks #markus):
dat <- split(dat, dat$m)
dat <- lapply(dat, function(x) t.test(x$n1, g))
Or
dat <- by(dat, m, function(x) t.test(x$n1, g))
Which gives us:
$a
Welch Two Sample t-test
data: .x$n1 and g
t = 1.5268, df = 3.0809, p-value = 0.2219
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.61161 33.64902
sample estimates:
mean of x mean of y
21.2500 10.2313
$b
Welch Two Sample t-test
data: .x$n1 and g
t = 1.8757, df = 2.2289, p-value = 0.1883
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.325666 20.863073
sample estimates:
mean of x mean of y
17.0000 10.2313
$c
Welch Two Sample t-test
data: .x$n1 and g
t = 10.565, df = 19, p-value = 2.155e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
7.031598 10.505808
sample estimates:
mean of x mean of y
19.0000 10.2313
In base R you can do
lapply(split(dat, dat$m), function(x) t.test(x$n1, g))
Output
$a
Welch Two Sample t-test
data: x$n1 and g
t = 1.9586, df = 3.2603, p-value = 0.1377
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.033451 27.819258
sample estimates:
mean of x mean of y
21.0000 10.1071
$b
Welch Two Sample t-test
data: x$n1 and g
t = 2.3583, df = 2.3202, p-value = 0.1249
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.96768 25.75349
sample estimates:
mean of x mean of y
20.0000 10.1071
$c
Welch Two Sample t-test
data: x$n1 and g
t = 13.32, df = 15.64, p-value = 6.006e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
13.77913 19.00667
sample estimates:
mean of x mean of y
26.5000 10.1071
Data
set.seed(1)
dat <- data.frame(m=c(rep("a",4),rep("b",3),rep("c",2)),
n1 =round(rnorm(mean = 20,sd = 10,n = 9)))
g <- rnorm(20,10,5)
I have a data frame with count numbers and I want to perform a chisq.test for each value of the variable Cluster. So basically, I need 4 contingency tables (for "A","B","C","D") where rows = Category, columns = Drug, value = Total. And subsequently a chisq.test should be run for all 4 tabels.
Example data frame
df <- data.frame(Cluster = c(rep("A",8),rep("B",8),rep("C",8),rep("D",8)),
Category = rep(c(rep("0-1",2),rep("2-4",2),rep("5-12",2),rep(">12",2)),2),
Drug = rep(c("drug X","drug Y"),16),
Total = as.numeric(sample(20:200,32,replace=TRUE)))
Firstly, use xtabs() to produce stratified contingency tables.
tab <- xtabs(Total ~ Category + Drug + Cluster, df)
tab
# , , Cluster = A
#
# Drug
# Category drug X drug Y
# >12 92 75
# 0-1 33 146
# 2-4 193 95
# 5-12 76 195
#
# etc.
Then use apply() to conduct a Pearson's Chi-squared test over each stratum.
apply(tab, 3, chisq.test)
# $A
#
# Pearson's Chi-squared test
#
# data: array(newX[, i], d.call, dn.call)
# X-squared = 145.98, df = 3, p-value < 2.2e-16
#
# etc.
Furthermore, you can perform a Cochran-Mantel-Haenszel chi-squared test for conditional independence.
mantelhaen.test(tab)
# Cochran-Mantel-Haenszel test
#
# data: tab
# Cochran-Mantel-Haenszel M^2 = 59.587, df = 3, p-value = 7.204e-13
We are using the var.test() function in R e.g:
T1<-rnorm(255,mean=1.432,sd=0.255)
T2<-rnorm(256,mean=1.485,sd=0.251)
var.test(T1,T2)
# F test to compare two variances
#
# data: T1 and T2
# F = 1.1027, num df = 254, denom df = 255, p-value = 0.436
# alternative hypothesis: true ratio of variances is not equal to 1
# 95 percent confidence interval:
# 0.8620164 1.4106568
# sample estimates:
# ratio of variances
# 1.102695
However, when we rerun the test using the same data we get very different results e.g:
T1<-rnorm(255,mean=1.432,sd=0.255)
T2<-rnorm(256,mean=1.485,sd=0.251)
var.test(T1,T2)
# F test to compare two variances
#
# data: T1 and T2
# F = 0.79853, num df = 254, denom df = 255, p-value = 0.07334
# alternative hypothesis: true ratio of variances is not equal to 1
# 95 percent confidence interval:
# 0.6242396 1.0215441
# sample estimates:
# ratio of variances
# 0.7985297
Why does this happen? Are we doing something wrong?
We have multiple data sets to analyse & we need to understand what is happening.
To make your analyzes reproducible you may use set.seed, which specifies the seed of the R-random number generator.
set.seed(42) # set seed
T1 <- rnorm(255, mean=1.432, sd=0.255)
T2 <- rnorm(256, mean=1.485, sd=0.251)
var.test(T1, T2)
# same seed - same result
set.seed(42)
T1 <- rnorm(255, mean=1.432, sd=0.255)
T2 <- rnorm(256, mean=1.485, sd=0.251)
var.test(T1, T2)
# different seed - different result
set.seed(123)
T1 <- rnorm(255, mean=1.432, sd=0.255)
T2 <- rnorm(256, mean=1.485, sd=0.251)
var.test(T1, T2)
All,
I would like to perform the equivalent of TukeyHSD on the rank ordering median shift test that such as kruskal wallis
X=matrix(c(1,1,1,1,2,2,2,4,4,4,4,4,1,3,6,9,4,6,8,10,1,2,1,3),ncol=2)
anova=aov(X[,2]~factor(X[,1]))
TukeyHSD(anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = X[, 2] ~ factor(X[, 1]))
##
## $`factor(X[, 1])`
## diff lwr upr p adj
## 2-1 1.25 -5.927068 8.427068 0.8794664
## 4-1 -1.35 -7.653691 4.953691 0.8246844
## 4-2 -2.60 -9.462589 4.262589 0.5617125
kruskal.test(X[,2]~factor(X[,1]))
##
## Kruskal-Wallis rank sum test
##
## data: X[, 2] by factor(X[, 1])
## Kruskal-Wallis chi-squared = 1.7325, df = 2, p-value = 0.4205
I would like now to analyze the contrasts. Please help. Thanks.
Rik
If you want to do multiple comparisons after a Kruskal-Wallis test, you need the kruskalmc function from the pgirmess package. Before you can implement this function, you will need to transform your matrix to a dataframe. In your example:
# convert matrix to dataframe
dfx <- as.data.frame(X)
# the Kruskal-Wallis test & output
kruskal.test(dfx$V2~factor(dfx$V1))
Kruskal-Wallis rank sum test
data: dfx$V2 by factor(dfx$V1)
Kruskal-Wallis chi-squared = 1.7325, df = 2, p-value = 0.4205
# the post-hoc tests & output
kruskalmc(V2~factor(V1), data = dfx)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
1-2 1.75 6.592506 FALSE
1-4 1.65 5.790265 FALSE
2-4 3.40 6.303642 FALSE
If you want the compact letter display similar to what is outputed from TukeyHSD, for the Kruskal test, the library agricolae allows it with the function kruskal. Using your own data:
library(agricolae)
print( kruskal(X[, 2], factor(X[, 1]), group=TRUE, p.adj="bonferroni") )
#### ...
#### $groups
#### trt means M
#### 1 2 8.50 a
#### 2 1 6.75 a
#### 3 4 5.10 a
(well, in this example the groups are not considered different, same result than the other answer..)