T.test on a small set of data from a large set - r

we compare the average (or mean) of one group against the set average (or mean). This set average can be any theoretical value (or it can be the population mean).
I am trying to compute the average mean of a small group of 300 observations against 1500 observations using one sided t.test.Is this approach correct? If not is there an alternative to this?
head(data$BMI)
attach(data)
tester<-mean(BMI)
table(BMI)
set.seed(123)
sampler<-sample(min(BMI):max(BMI),300,replace = TRUE)
mean(sampler)
t.test(sampler,tester)
The last line of the code yield-
Error in t.test.default(sampler, tester) : not enough 'y' observations

For testing your sample in t.test, you can do:
d <- rnorm(1500,mean = 3, sd = 1)
s <- sample(d,300)
Then, test for the normality of d and s:
> shapiro.test(d)
Shapiro-Wilk normality test
data: d
W = 0.9993, p-value = 0.8734
> shapiro.test(s)
Shapiro-Wilk normality test
data: s
W = 0.99202, p-value = 0.1065
Here the test is superior to 0.05, so you could consider that both d and s are normally distributed. So, you can test for t.test:
> t.test(d,s)
Welch Two Sample t-test
data: d and s
t = 0.32389, df = 444.25, p-value = 0.7462
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.09790144 0.13653776
sample estimates:
mean of x mean of y
2.969257 2.949939

Related

R - two-sample_t_test Unchanged When Switching Between Pooled Variance and non-Pooled Variance

I run the R code as follows:
library(oibiostat)
data("swim")
## independent two-sample pooled t test
t.test(swim$wet.suit.velocity, swim$swim.suit.velocity,
alternative = "two.sided", paired = FALSE, var.equal = TRUE)
#unequal variance two-sample t test
t.test(swim$wet.suit.velocity, swim$swim.suit.velocity,
alternative = "two.sided", paired = FALSE, var.equal = FALSE)
Which results in the same output:
Two Sample t-test
data: swim$wet.suit.velocity and swim$swim.suit.velocity
t = 1.3688, df = 22, p-value = 0.1849
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.03992124 0.19492124
sample estimates:
mean of x mean of y
1.506667 1.429167
and
Welch Two Sample t-test
data: swim$wet.suit.velocity and swim$swim.suit.velocity
t = 1.3688, df = 21.974, p-value = 0.1849
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.03992937 0.19492937
sample estimates:
mean of x mean of y
1.506667 1.429167
The pooled two-sample t.test should be different from un-pooled one in terms of formulas.
But if I run the code as follows:
set.seed(5)
x1 = rnorm(15, 95, 20)
x2 = rnorm(50, 110, 5)
t.test(x1, x2) # Welch
t.test(x1, x2, var.eq=T) # pooled
The outputs from both t.test are clearly different. So, I just got a coincidence of data set?
I calculate by hand and find that the output from Welch Two Sample t-test is right. I am very confused why the output of pooled t.test is wrong.
Edit
Like I say in comment, package oibiostat is not on CRAN, it's on GitHub. If not installed yet, run
devtools::install_github("OI-Biostat/oi_biostat_data")
And there's no need to load a package to access one of its data sets, the following will load it.
data(swim, package = "oibiostat")
You have equal sample sizes in the two groups, n_1=n_2=12. That is important, because in that case the test statistics for the pooled (equal-variance) t-test and the Welch (unequal-variance) r-test are equal in value, as you can verify by consulting the formulas at Wikipedia, explaining why you get identical results. (This have been discussed at Cross Validated, but I couldn't find where ... But see maybe https://stats.stackexchange.com/questions/563859/equal-variance-vs-unequal-variance-for-comparing-groups and search that site).
There is one small difference, the degrees of freedom are not equal, but the difference is not large enough to make for a difference in p-values or confidence intervals.
But, by the way, in your swim example, the samples are paired, so you should really do
with(swim, t.test(wet.suit.velocity, swim.suit.velocity,
alternative = "two.sided", paired = TRUE))
Paired t-test
data: wet.suit.velocity and swim.suit.velocity
t = 12.318, df = 11, p-value = 8.885e-08
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.06365244 0.09134756
sample estimates:
mean difference
0.0775

Why does survey::svyranktest() calculate a different p-value to coin::wilcox_test() and stats::wilcox.test()?

I intend to use survey::svyranktest to run a weighted Wilcoxon Rank-Sum Test, as I don't believe either coin::wilcox_test or stats::wilcox.test() allow for weighted distributions.
To ensure it is working as expected I compared outputs on a small non-weighted sample data frame that is defined below.
I expected each output to be different, e.g. the svyranktest() outputs the H statistic from the Kruskal Wallis Test, while wilcox.test outputs the W value.
However, I expected the p-values for each to match. They are similar, but the p-value from the svryranktest is different to the others and I want to understand why before I proceed to use this function.
sample_df<- tibble(Gender=c("female","female","female","female","female","female","male","male","male","male","male"),
Reactiontime=c(34,36,41,43,44,37,45,33,35,39,42),
Rank=c(2,4,7,9,10,5,11,1,3,6,8))
I then ran each test, I'll show the code I used and the output obtained from each.
wilcox.test(Reactiontime ~ Gender , data=sample_df, exact=F, correct=F)
Wilcoxon rank sum test
data: Reactiontime by Gender
W = 16, p-value = 0.8551
alternative hypothesis: true location shift is not equal to 0
coin::wilcox_test(Reactiontime ~ as.factor(Gender), data=sample_df)
Asymptotic Wilcoxon-Mann-Whitney Test
data: Reactiontime by as.factor(Gender) (female, male)
Z = 0.18257, p-value = 0.8551
alternative hypothesis: true mu is not equal to 0
design_test <- survey::svydesign(ids = ~0, data = sample_df)
mw_test <- survey::svyranktest(formula = Reactiontime ~ Gender , design_test, test = "wilcoxon")
Design-based KruskalWallis test
data: Reactiontime ~ Gender
t = -0.17904, df = 9, p-value = 0.8619
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score
-0.03333333

How do I compute point-by-point t tests between two 50 data point vectors?

I have a data frame with 3 variables and 50 instances (ID,pre and post).somewhat like this
ID<- c("1","2","3","4","5","6","7","8","9","10")
pre<- c("2.56802","2.6686","1.0145","0.2568","2.369","1.2365","0.6989","0.98745","1.09878","2.454658")
post<-c("3.3323","2.66989","1.565656","2.58989","5.96987","3.12145","1.23565","2.74741","2.54101","0.23568")
dfw1<-data.frame(ID,pre,post)
Pre and post columns are mean of other population. I want to run two-tailed t-test between first elements of both pre and post.(pre against post). I want this to loop over all 50 rows. I have tried writing loops as shown below,
t<-0
for (i in 1:nrow(dfw$ID)) {
t[i]<-t.test(dfw$pre,dfw$post,alternative = c("two.sided"), conf.level = 0.95)
print(t)
}
it returned an error
I want to extract statistics of above such as df,p-value, t-value for each row and so on. How do I write this code in R?
This code shows that you cannot reject the null hypothesis of 0 difference at the conventional 5% confidence level:
ID<- c("1","2","3","4","5","6","7","8","9","10")
pre<- as.numeric(c("2.56802","2.6686","1.0145","0.2568","2.369","1.2365","0.6989","0.98745","1.09878","2.454658"))
post<-as.numeric(c("3.3323","2.66989","1.565656","2.58989","5.96987","3.12145","1.23565","2.74741","2.54101","0.23568"))
dfw1<-data.frame(ID,pre,post)
t.test(dfw1$pre,dfw1$post,alternative = c("two.sided"), conf.level = 0.95, paired=TRUE)
Output (giving you the df, t-stat and p-value):
Paired t-test
data: dfw1$pre and dfw1$post
t = -2.1608, df = 9, p-value = 0.05899
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.18109315 0.04997355
sample estimates:
mean of the differences
-1.06556

Confidence Interval for a Chi-Square in R

How to calculate Confidence Interval for a Chi-Square in R. Is there a function like chisq.test(),
There is no confidence interval for a chi-square test (you're just checking to see if the first categorical and the second categorical variable are independent), but you can do a confidence interval for the difference in proportions, like this.
Say you have some data where 30% of the first group report success, while 70% of a second group report success:
row1 <- c(70,30)
row2 <- c(30,70)
my.table <- rbind(row1,row2)
Now you have data in contingency table:
> my.table
[,1] [,2]
row1 70 30
row2 30 70
Which you can run chisq.test on, and clearly those two proportions are significantly different so the categorical variables must be independent:
> chisq.test(my.table)
Pearson's Chi-squared test with Yates' continuity correction
data: my.table
X-squared = 30.42, df = 1, p-value = 3.479e-08
If you do prop.test you find that you are 95% confident the difference between the proportions is somewhere between 26.29% and 53.70%, which makes sense, because the actual difference between the two observed proportions is 70%-30%=40%:
> prop.test(x=c(70,30),n=c(100,100))
2-sample test for equality of proportions with continuity correction
data: c(70, 30) out of c(100, 100)
X-squared = 30.42, df = 1, p-value = 3.479e-08
alternative hypothesis: two.sided
95 percent confidence interval:
0.2629798 0.5370202
sample estimates:
prop 1 prop 2
0.7 0.3
An addition to #mysteRious' nice answer: If you have a 2x2 contingency matrix, you could use fisher.test instead of prop.test to test for differences in the ratio of proportions instead of the difference of ratios. In Fisher's exact test the null hypothesis corresponds to an odds-ratio (OR) = 1.
Using #mysteRious' sample data
ft <- fisher.test(my.table)
ft
#
# Fisher's Exact Test for Count Data
#
#data: my.table
#p-value = 2.31e-08
#alternative hypothesis: true odds ratio is not equal to 1
#95 percent confidence interval:
# 2.851947 10.440153
#sample estimates:
#odds ratio
# 5.392849
Confidence intervals for the OR are then given in fit$conf.int
ft$conf.int
#[1] 2.851947 10.440153
#attr(,"conf.level")
#[1] 0.95
To confirm, we manually calculate the OR
OR <- Reduce("/", my.table[, 1] / my.table[, 2])
OR
#[1] 5.444444

t test in r giving wrong estimate of mean vs aggregate function

totaldata$Age2 <- ifelse(totaldata$Age<=50, 0, 1)
t.test(totaldata$concernsubscorehiv, totaldata$Age2,alternative='two.sided',na.rm=TRUE, conf.level=.95, paired=FALSE
This code yiels this result:
Welch Two Sample t-test
data:
totaldata$concernsubscorehiv and totaldata$Age2
t = 33.19, df = 127.42, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.370758 3.798164
sample estimates:
mean of x mean of y
4.336842 0.752381
As you can see the mean of group y is 0.752381
Then we I estimate the mean of each group using this:
aggregate(totaldata$concernsubscorehiv~totaldata$Age2,data=totaldata,mean)
This yields
totaldata$Age2 totaldata$concernsubscorehiv
1 0 4.354286
2 1 4.330612
As you can see the mean of group 0 is 4.354286 not 0.752381 as estimated by t test. What is the problem?
You don't use t.test correctly. 0.752381 is the fraction of people for which age2 is 1. You are supplying a vector of your normal data, and a vector of zero and ones, when instead you want to split up the first vector based on the grouping in the second.
Consider the following:
out <- rnorm(10)*5+100
bin <- rbinom(n=10, size=1, prob=0.5)
mean(out)
[1] 101.9462
mean(bin)
[1] 0.4
From the ?t.test helpfile, we know that x and y are:
x a (non-empty) numeric vector of data values.
y an optional (non-empty) numeric vector of data values.
So, by supplying both out and bin, I compare each vector to each other, which probably does not make much sense in this example. See:
t.test(out, bin)
Welch Two Sample t-test
data: out and bin
t = 86.665, df = 9.3564, p-value = 6.521e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
98.91092 104.18149
sample estimates:
mean of x mean of y
101.9462 0.4000
Here, you see that t.test correctly estimated the means for my two supplied vectors, as I have shown above. What you want to do is to split up the first vector based on whether or not the second is 0 or 1.
In my toy example, I can do this easily by writing:
t.test(out[which(bin==1)], out[which(bin==0)])
Welch Two Sample t-test
data: out[which(bin == 1)] and out[which(bin == 0)]
t = 0.34943, df = 5.1963, p-value = 0.7405
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.828182 7.686176
sample estimates:
mean of x mean of y
102.5036 101.5746
Here, these two means correspond exactly to
tapply(out, bin, mean)
0 1
101.5746 102.5036

Resources