Not getting the desire record Pearson results - r

Any help is greatly appreciated. I did get help with cleaning up the data so it works great that way, no errors. But it's not the same values so I'm not sure where it's wrong. I've tried even running the cor.test a few different ways and don't come up with the right values. It won't allow me to upload the dataset because it isn't local, but it is nycflights13 found here https://nycflights13.tidyverse.org/
Here's my initial code:
suppressPackageStartupMessages(library(nycflights13))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lm.beta))
suppressPackageStartupMessages(library(data.table))
Q1 <- (conf.level = 99.45) #based on Q2 values .. confidence level
delay_thresh = quantile(flights$dep_delay, p=c(0.003, 0.997), na.rm=T)
dist_thresh = quantile(flights$distance,p=c(0.003, 0.997), na.rm=T)
nycflights13_DT <- as.data.table(flights)
nycflights13_clean <- nycflights13_DT[nycflights13_DT$dep_delay > delay_thresh[[1]] &
nycflights13_DT$dep_delay < delay_thresh[[2]] &
nycflights13_DT$distance>dist_thresh[[1]] &
nycflights13_DT$distance < dist_thresh[[2]]]
Q2 <- cor.test(nycflights13_clean$dep_delay, nycflights13_clean$distance)
model2 = (lm(nycflights13_clean$dep_delay ~ nycflights13_clean$distance))
Q3 <- summary(model2)
So then I run:
Q2 <- cor.test(nycflights13_clean$dep_delay, nycflights13_clean$distance)
model2 = (lm(nycflights13_clean$dep_delay ~ nycflights13_clean$distance))
This is what I'm supposed to get:
t = -14.451, df = 326677, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.02870131 -0.02184737
sample estimates:
cor
-0.02527463
This is what I get:
Pearson's product-moment correlation
data: nycflights13_clean$dep_delay and nycflights13_clean$distance
t = -13.647, df = 316421, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.02773612 -0.02077163
sample estimates:
cor
-0.02425417
This is what I'm supposed to be doing:
Run cor.test for the relationship between departure delay and distance. Q2, do not round.
I can do almost anything in Python and NumPy but R has me scratching my head.

Related

Why does survey::svyranktest() calculate a different p-value to coin::wilcox_test() and stats::wilcox.test()?

I intend to use survey::svyranktest to run a weighted Wilcoxon Rank-Sum Test, as I don't believe either coin::wilcox_test or stats::wilcox.test() allow for weighted distributions.
To ensure it is working as expected I compared outputs on a small non-weighted sample data frame that is defined below.
I expected each output to be different, e.g. the svyranktest() outputs the H statistic from the Kruskal Wallis Test, while wilcox.test outputs the W value.
However, I expected the p-values for each to match. They are similar, but the p-value from the svryranktest is different to the others and I want to understand why before I proceed to use this function.
sample_df<- tibble(Gender=c("female","female","female","female","female","female","male","male","male","male","male"),
Reactiontime=c(34,36,41,43,44,37,45,33,35,39,42),
Rank=c(2,4,7,9,10,5,11,1,3,6,8))
I then ran each test, I'll show the code I used and the output obtained from each.
wilcox.test(Reactiontime ~ Gender , data=sample_df, exact=F, correct=F)
Wilcoxon rank sum test
data: Reactiontime by Gender
W = 16, p-value = 0.8551
alternative hypothesis: true location shift is not equal to 0
coin::wilcox_test(Reactiontime ~ as.factor(Gender), data=sample_df)
Asymptotic Wilcoxon-Mann-Whitney Test
data: Reactiontime by as.factor(Gender) (female, male)
Z = 0.18257, p-value = 0.8551
alternative hypothesis: true mu is not equal to 0
design_test <- survey::svydesign(ids = ~0, data = sample_df)
mw_test <- survey::svyranktest(formula = Reactiontime ~ Gender , design_test, test = "wilcoxon")
Design-based KruskalWallis test
data: Reactiontime ~ Gender
t = -0.17904, df = 9, p-value = 0.8619
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score
-0.03333333

cor.test() p value different than by hand?

The p-value using cor.test() is different than by hand. I can't figure out what in the world I'm missing. Any help would be greatly appreciated!
Pearson's product-moment correlation
data: GSSE_new$MusicPerceptionScores and GSSE_new$MusicAptitudeScores
t = 27.152, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8811990 0.9359591
sample estimates:
cor
0.9125834
#######
2*pt(q=MPMA_cortest$statistic, df=MPMA_cortest$parameter, lower.tail=FALSE)
[1] 2.360846e-59
Since you have not supplied a Minimal Reproducible Example with actual data, I cannot confirm with your own data, but here is a procedure that shows the manual version is equal to the cor.test p value:
MPMA_cortest <- cor.test(mtcars$hp, mtcars$mpg)
p_manual <- pt(
q = abs(MPMA_cortest$statistic),
df = MPMA_cortest$parameter,
lower.tail = FALSE) * 2
p_manual == MPMA_cortest$p.value
#> t
#> TRUE
Edit: Also note that the cor.test printout only says p-value < 2.2e-16. The two values may well be exactly equal (yours is smaller, thus meeting the inequality condition).

T.test on a small set of data from a large set

we compare the average (or mean) of one group against the set average (or mean). This set average can be any theoretical value (or it can be the population mean).
I am trying to compute the average mean of a small group of 300 observations against 1500 observations using one sided t.test.Is this approach correct? If not is there an alternative to this?
head(data$BMI)
attach(data)
tester<-mean(BMI)
table(BMI)
set.seed(123)
sampler<-sample(min(BMI):max(BMI),300,replace = TRUE)
mean(sampler)
t.test(sampler,tester)
The last line of the code yield-
Error in t.test.default(sampler, tester) : not enough 'y' observations
For testing your sample in t.test, you can do:
d <- rnorm(1500,mean = 3, sd = 1)
s <- sample(d,300)
Then, test for the normality of d and s:
> shapiro.test(d)
Shapiro-Wilk normality test
data: d
W = 0.9993, p-value = 0.8734
> shapiro.test(s)
Shapiro-Wilk normality test
data: s
W = 0.99202, p-value = 0.1065
Here the test is superior to 0.05, so you could consider that both d and s are normally distributed. So, you can test for t.test:
> t.test(d,s)
Welch Two Sample t-test
data: d and s
t = 0.32389, df = 444.25, p-value = 0.7462
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.09790144 0.13653776
sample estimates:
mean of x mean of y
2.969257 2.949939

How do I compute point-by-point t tests between two 50 data point vectors?

I have a data frame with 3 variables and 50 instances (ID,pre and post).somewhat like this
ID<- c("1","2","3","4","5","6","7","8","9","10")
pre<- c("2.56802","2.6686","1.0145","0.2568","2.369","1.2365","0.6989","0.98745","1.09878","2.454658")
post<-c("3.3323","2.66989","1.565656","2.58989","5.96987","3.12145","1.23565","2.74741","2.54101","0.23568")
dfw1<-data.frame(ID,pre,post)
Pre and post columns are mean of other population. I want to run two-tailed t-test between first elements of both pre and post.(pre against post). I want this to loop over all 50 rows. I have tried writing loops as shown below,
t<-0
for (i in 1:nrow(dfw$ID)) {
t[i]<-t.test(dfw$pre,dfw$post,alternative = c("two.sided"), conf.level = 0.95)
print(t)
}
it returned an error
I want to extract statistics of above such as df,p-value, t-value for each row and so on. How do I write this code in R?
This code shows that you cannot reject the null hypothesis of 0 difference at the conventional 5% confidence level:
ID<- c("1","2","3","4","5","6","7","8","9","10")
pre<- as.numeric(c("2.56802","2.6686","1.0145","0.2568","2.369","1.2365","0.6989","0.98745","1.09878","2.454658"))
post<-as.numeric(c("3.3323","2.66989","1.565656","2.58989","5.96987","3.12145","1.23565","2.74741","2.54101","0.23568"))
dfw1<-data.frame(ID,pre,post)
t.test(dfw1$pre,dfw1$post,alternative = c("two.sided"), conf.level = 0.95, paired=TRUE)
Output (giving you the df, t-stat and p-value):
Paired t-test
data: dfw1$pre and dfw1$post
t = -2.1608, df = 9, p-value = 0.05899
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.18109315 0.04997355
sample estimates:
mean of the differences
-1.06556

Pairwise t test using plyr

I would like to use the R package plyr to run a pairwise t test on a really large data frame, but I'm not sure how to do it. I recently learned how to do correlations using plyr, and I really like how you can specify which groups you want to compare and then plyr breaks down the data for you. For example, you could have plyr calculate the correlation between sepal length and sepal width for each species of iris in the iris dataset like this:
Correlations <- ddply(iris, "Species", function(x) cor(x$Sepal.Length, x$Sepal.Width))
I could break the data frame down myself by specifying that the data for the setosa species of iris are in rows 1:50 and so on, but plyr would be less likely than me to mess up and accidentally say rows 1:51, for example.
So how do I do something similar with a paired t test? How can I specify which observations are the pairs? Here's some example data that are similar to what I'm working with, and I'd like the pairs to be the Subject and I'd like to break the data down by Pesticide:
Exposure <- data.frame("Subject" = rep(1:4, 6),
"Season" = rep(c(rep("summer", 4), rep("winter", 4)),3),
"Pesticide" = rep(c("atrazine", "metolachlor", "chlorpyrifos"), each=8),
"Exposure" = sample(1:100, size=24))
Exposure$Subject <- as.factor(Exposure$Subject)
In other words, the question I'd like to evaluate is whether there is a difference in pesticide exposure for each person during the winter versus during the summer, and I'd like to answer that question separately for each of the three pesticides.
Much thanks in advance!
An edit: To clarify, this is how to do an unpaired t test in plyr:
TTests <- dlply(Exposure, "Pesticide", function(x) t.test(x$Exposure ~ x$Season))
And if I add "paired=T" in there, plyr will do a paired t test, but it assumes that I always have the pairs in the same order. While I do have them all in the same order in the example data frame above, I don't in my real data because I sometimes have missing data.
Do you want this?
library(data.table)
# convert to data.table in place
setDT(Exposure)
# make sure data is sorted correctly
setkey(Exposure, Pesticide, Season, Subject)
Exposure[, list(res = list(t.test(Exposure[Season == "summer"],
Exposure[Season == "winter"],
paired = T)))
, by = Pesticide]$res
#[[1]]
#
# Paired t-test
#
#data: Exposure[Season == "summer"] and Exposure[Season == "winter"]
#t = -4.1295, df = 3, p-value = 0.02576
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -31.871962 -4.128038
#sample estimates:
#mean of the differences
# -18
#
#
#[[2]]
#
# Paired t-test
#
#data: Exposure[Season == "summer"] and Exposure[Season == "winter"]
#t = -6.458, df = 3, p-value = 0.007532
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -73.89299 -25.10701
#sample estimates:
#mean of the differences
# -49.5
#
#
#[[3]]
#
# Paired t-test
#
#data: Exposure[Season == "summer"] and Exposure[Season == "winter"]
#t = -2.5162, df = 3, p-value = 0.08646
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -30.008282 3.508282
#sample estimates:
#mean of the differences
# -13.25
I don't know ddply, but here's how I would do using some base functions.
by(data = Exposure, INDICES = Exposure$Pesticide, FUN = function(x) {
t.test(Exposure ~ Season, data = x)
})
Exposure$Pesticide: atrazine
Welch Two Sample t-test
data: Exposure by Season
t = -0.1468, df = 5.494, p-value = 0.8885
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-49.63477 44.13477
sample estimates:
mean in group summer mean in group winter
60.50 63.25
----------------------------------------------------------------------------------------------
Exposure$Pesticide: chlorpyrifos
Welch Two Sample t-test
data: Exposure by Season
t = -0.8932, df = 4.704, p-value = 0.4151
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-83.58274 41.08274
sample estimates:
mean in group summer mean in group winter
52.25 73.50
----------------------------------------------------------------------------------------------
Exposure$Pesticide: metolachlor
Welch Two Sample t-test
data: Exposure by Season
t = 0.8602, df = 5.561, p-value = 0.4252
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-39.8993 81.8993
sample estimates:
mean in group summer mean in group winter
62.5 41.5

Resources