The p-value using cor.test() is different than by hand. I can't figure out what in the world I'm missing. Any help would be greatly appreciated!
Pearson's product-moment correlation
data: GSSE_new$MusicPerceptionScores and GSSE_new$MusicAptitudeScores
t = 27.152, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8811990 0.9359591
sample estimates:
cor
0.9125834
#######
2*pt(q=MPMA_cortest$statistic, df=MPMA_cortest$parameter, lower.tail=FALSE)
[1] 2.360846e-59
Since you have not supplied a Minimal Reproducible Example with actual data, I cannot confirm with your own data, but here is a procedure that shows the manual version is equal to the cor.test p value:
MPMA_cortest <- cor.test(mtcars$hp, mtcars$mpg)
p_manual <- pt(
q = abs(MPMA_cortest$statistic),
df = MPMA_cortest$parameter,
lower.tail = FALSE) * 2
p_manual == MPMA_cortest$p.value
#> t
#> TRUE
Edit: Also note that the cor.test printout only says p-value < 2.2e-16. The two values may well be exactly equal (yours is smaller, thus meeting the inequality condition).
Related
I run the R code as follows:
library(oibiostat)
data("swim")
## independent two-sample pooled t test
t.test(swim$wet.suit.velocity, swim$swim.suit.velocity,
alternative = "two.sided", paired = FALSE, var.equal = TRUE)
#unequal variance two-sample t test
t.test(swim$wet.suit.velocity, swim$swim.suit.velocity,
alternative = "two.sided", paired = FALSE, var.equal = FALSE)
Which results in the same output:
Two Sample t-test
data: swim$wet.suit.velocity and swim$swim.suit.velocity
t = 1.3688, df = 22, p-value = 0.1849
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.03992124 0.19492124
sample estimates:
mean of x mean of y
1.506667 1.429167
and
Welch Two Sample t-test
data: swim$wet.suit.velocity and swim$swim.suit.velocity
t = 1.3688, df = 21.974, p-value = 0.1849
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.03992937 0.19492937
sample estimates:
mean of x mean of y
1.506667 1.429167
The pooled two-sample t.test should be different from un-pooled one in terms of formulas.
But if I run the code as follows:
set.seed(5)
x1 = rnorm(15, 95, 20)
x2 = rnorm(50, 110, 5)
t.test(x1, x2) # Welch
t.test(x1, x2, var.eq=T) # pooled
The outputs from both t.test are clearly different. So, I just got a coincidence of data set?
I calculate by hand and find that the output from Welch Two Sample t-test is right. I am very confused why the output of pooled t.test is wrong.
Edit
Like I say in comment, package oibiostat is not on CRAN, it's on GitHub. If not installed yet, run
devtools::install_github("OI-Biostat/oi_biostat_data")
And there's no need to load a package to access one of its data sets, the following will load it.
data(swim, package = "oibiostat")
You have equal sample sizes in the two groups, n_1=n_2=12. That is important, because in that case the test statistics for the pooled (equal-variance) t-test and the Welch (unequal-variance) r-test are equal in value, as you can verify by consulting the formulas at Wikipedia, explaining why you get identical results. (This have been discussed at Cross Validated, but I couldn't find where ... But see maybe https://stats.stackexchange.com/questions/563859/equal-variance-vs-unequal-variance-for-comparing-groups and search that site).
There is one small difference, the degrees of freedom are not equal, but the difference is not large enough to make for a difference in p-values or confidence intervals.
But, by the way, in your swim example, the samples are paired, so you should really do
with(swim, t.test(wet.suit.velocity, swim.suit.velocity,
alternative = "two.sided", paired = TRUE))
Paired t-test
data: wet.suit.velocity and swim.suit.velocity
t = 12.318, df = 11, p-value = 8.885e-08
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.06365244 0.09134756
sample estimates:
mean difference
0.0775
Any help is greatly appreciated. I did get help with cleaning up the data so it works great that way, no errors. But it's not the same values so I'm not sure where it's wrong. I've tried even running the cor.test a few different ways and don't come up with the right values. It won't allow me to upload the dataset because it isn't local, but it is nycflights13 found here https://nycflights13.tidyverse.org/
Here's my initial code:
suppressPackageStartupMessages(library(nycflights13))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lm.beta))
suppressPackageStartupMessages(library(data.table))
Q1 <- (conf.level = 99.45) #based on Q2 values .. confidence level
delay_thresh = quantile(flights$dep_delay, p=c(0.003, 0.997), na.rm=T)
dist_thresh = quantile(flights$distance,p=c(0.003, 0.997), na.rm=T)
nycflights13_DT <- as.data.table(flights)
nycflights13_clean <- nycflights13_DT[nycflights13_DT$dep_delay > delay_thresh[[1]] &
nycflights13_DT$dep_delay < delay_thresh[[2]] &
nycflights13_DT$distance>dist_thresh[[1]] &
nycflights13_DT$distance < dist_thresh[[2]]]
Q2 <- cor.test(nycflights13_clean$dep_delay, nycflights13_clean$distance)
model2 = (lm(nycflights13_clean$dep_delay ~ nycflights13_clean$distance))
Q3 <- summary(model2)
So then I run:
Q2 <- cor.test(nycflights13_clean$dep_delay, nycflights13_clean$distance)
model2 = (lm(nycflights13_clean$dep_delay ~ nycflights13_clean$distance))
This is what I'm supposed to get:
t = -14.451, df = 326677, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.02870131 -0.02184737
sample estimates:
cor
-0.02527463
This is what I get:
Pearson's product-moment correlation
data: nycflights13_clean$dep_delay and nycflights13_clean$distance
t = -13.647, df = 316421, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.02773612 -0.02077163
sample estimates:
cor
-0.02425417
This is what I'm supposed to be doing:
Run cor.test for the relationship between departure delay and distance. Q2, do not round.
I can do almost anything in Python and NumPy but R has me scratching my head.
I am wondering if there is a way to change the output settings of the prop.test function in R so that it displays the confidence interval already in percentage terms instead of a decimal? For example, I am trying to find the 95% confidence interval for the proportion of immigrants in the West with diabetes. Here is my code and the output:
sum(Immigrant_West$DIABETES)= 8, nrow(Immigrant_West)=144
prop.test(x=sum(Immigrant_West$DIABETES),n=nrow(Immigrant_West),conf.level = .95,correct=TRUE)
> 1-sample proportions test with continuity correction
data: sum(Immigrant_West$DIABETES) out of nrow(Immigrant_West), null probability 0.5
X-squared = 112, df = 1, p-value <2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.02606 0.11017
sample estimates:
p
0.05556
So is there a way to change the confidence interval output to show [2.606%, 11.017%] instead of as decimals? Thank you!
This will probably be simpler:
prp.out <- prop.test(x=8, n=144, conf.level=.95, correct=TRUE)
prp.out$conf.int <- prp.out$conf.int * 100
prp.out
#
# 1-sample proportions test with continuity correction
#
# data: 8 out of 144, null probability 0.5
# X-squared = 112.01, df = 1, p-value < 2.2e-16
# alternative hypothesis: true p is not equal to 0.5
# 95 percent confidence interval:
# 2.606172 11.016593
# sample estimates:
# p
# 0.05555556
Not easily. The print format is controlled by the print.htest() function, which is documented in ?print.htest: it doesn't seem to offer any options other than the number of digits and the prefix for the "method" component.
If you want, you can hack the function yourself. (More details to follow, maybe; dump stats:::print.htest to a file, then edit it, then source() it.
As suggested by #CarlWitthoft:
p <- prop.test(x=8,n=144)
str(p) ## see what's there
p$estimate <- p$estimate*100
p$conf.int <- p$conf.int*100
How to calculate Confidence Interval for a Chi-Square in R. Is there a function like chisq.test(),
There is no confidence interval for a chi-square test (you're just checking to see if the first categorical and the second categorical variable are independent), but you can do a confidence interval for the difference in proportions, like this.
Say you have some data where 30% of the first group report success, while 70% of a second group report success:
row1 <- c(70,30)
row2 <- c(30,70)
my.table <- rbind(row1,row2)
Now you have data in contingency table:
> my.table
[,1] [,2]
row1 70 30
row2 30 70
Which you can run chisq.test on, and clearly those two proportions are significantly different so the categorical variables must be independent:
> chisq.test(my.table)
Pearson's Chi-squared test with Yates' continuity correction
data: my.table
X-squared = 30.42, df = 1, p-value = 3.479e-08
If you do prop.test you find that you are 95% confident the difference between the proportions is somewhere between 26.29% and 53.70%, which makes sense, because the actual difference between the two observed proportions is 70%-30%=40%:
> prop.test(x=c(70,30),n=c(100,100))
2-sample test for equality of proportions with continuity correction
data: c(70, 30) out of c(100, 100)
X-squared = 30.42, df = 1, p-value = 3.479e-08
alternative hypothesis: two.sided
95 percent confidence interval:
0.2629798 0.5370202
sample estimates:
prop 1 prop 2
0.7 0.3
An addition to #mysteRious' nice answer: If you have a 2x2 contingency matrix, you could use fisher.test instead of prop.test to test for differences in the ratio of proportions instead of the difference of ratios. In Fisher's exact test the null hypothesis corresponds to an odds-ratio (OR) = 1.
Using #mysteRious' sample data
ft <- fisher.test(my.table)
ft
#
# Fisher's Exact Test for Count Data
#
#data: my.table
#p-value = 2.31e-08
#alternative hypothesis: true odds ratio is not equal to 1
#95 percent confidence interval:
# 2.851947 10.440153
#sample estimates:
#odds ratio
# 5.392849
Confidence intervals for the OR are then given in fit$conf.int
ft$conf.int
#[1] 2.851947 10.440153
#attr(,"conf.level")
#[1] 0.95
To confirm, we manually calculate the OR
OR <- Reduce("/", my.table[, 1] / my.table[, 2])
OR
#[1] 5.444444
I have a simple question. I've seen this behaviour in R for both t-tests and correlations.
I do a simple paired t-test (in this case, two vectors of length 100). So the df of the paired t-test should be 99. However this is not what appears in the t-test result output.
dataforTtest.x <- rnorm(100,3,1)
dataforTtest.y <- rnorm(100,1,1)
t.test(dataforTtest.x, dataforTtest.y,paired=TRUE)
the output of this is:
Paired t-test
data: dataforTtest.x and dataforTtest.y
t = 10, df = 100, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.6 2.1
sample estimates:
mean of the differences
1.8
BUT, if I actually look into the resulting object, the df are correct.
> t.test(dataforTtest.x, dataforTtest.y,paired=TRUE)[["parameter"]]
df
99
Am I missing something very stupid?
I'm running R version 3.3.0 (2016-05-03)
This problem can happen if the global setting for rounding numbers is changing in R, which would be done with something like options(digits=2).
Note the results of a t-test before changing this setting:
Paired t-test
data: dataforTtest.x and dataforTtest.y
t = 13.916, df = 99, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.700244 2.265718
sample estimates:
mean of the differences
1.982981
And after setting options(digits=2):
Paired t-test
data: dataforTtest.x and dataforTtest.y
t = 13.916, df = 100, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.700244 2.265718
sample estimates:
mean of the differences
2
In R, it can be dangerous to change the global settings for this reason. It could completely change the results of statistical analyses without the user's knowledge. Instead, we can either use the round() function directly on a number, or for test results like these, we can use it in combination with the broom package.
round(2.949,2)
[1] 2.95
#and
require(broom)
glance(t.test(dataforTtest.x, dataforTtest.y,paired=TRUE))
estimate statistic p.value parameter cnf.low cnf.high method alternative
1.831433 11.31853 1.494257e-19 99 1.51037 2.152496 Paired t-test two.sided