t test in r giving wrong estimate of mean vs aggregate function - r

totaldata$Age2 <- ifelse(totaldata$Age<=50, 0, 1)
t.test(totaldata$concernsubscorehiv, totaldata$Age2,alternative='two.sided',na.rm=TRUE, conf.level=.95, paired=FALSE
This code yiels this result:
Welch Two Sample t-test
data:
totaldata$concernsubscorehiv and totaldata$Age2
t = 33.19, df = 127.42, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.370758 3.798164
sample estimates:
mean of x mean of y
4.336842 0.752381
As you can see the mean of group y is 0.752381
Then we I estimate the mean of each group using this:
aggregate(totaldata$concernsubscorehiv~totaldata$Age2,data=totaldata,mean)
This yields
totaldata$Age2 totaldata$concernsubscorehiv
1 0 4.354286
2 1 4.330612
As you can see the mean of group 0 is 4.354286 not 0.752381 as estimated by t test. What is the problem?

You don't use t.test correctly. 0.752381 is the fraction of people for which age2 is 1. You are supplying a vector of your normal data, and a vector of zero and ones, when instead you want to split up the first vector based on the grouping in the second.
Consider the following:
out <- rnorm(10)*5+100
bin <- rbinom(n=10, size=1, prob=0.5)
mean(out)
[1] 101.9462
mean(bin)
[1] 0.4
From the ?t.test helpfile, we know that x and y are:
x a (non-empty) numeric vector of data values.
y an optional (non-empty) numeric vector of data values.
So, by supplying both out and bin, I compare each vector to each other, which probably does not make much sense in this example. See:
t.test(out, bin)
Welch Two Sample t-test
data: out and bin
t = 86.665, df = 9.3564, p-value = 6.521e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
98.91092 104.18149
sample estimates:
mean of x mean of y
101.9462 0.4000
Here, you see that t.test correctly estimated the means for my two supplied vectors, as I have shown above. What you want to do is to split up the first vector based on whether or not the second is 0 or 1.
In my toy example, I can do this easily by writing:
t.test(out[which(bin==1)], out[which(bin==0)])
Welch Two Sample t-test
data: out[which(bin == 1)] and out[which(bin == 0)]
t = 0.34943, df = 5.1963, p-value = 0.7405
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.828182 7.686176
sample estimates:
mean of x mean of y
102.5036 101.5746
Here, these two means correspond exactly to
tapply(out, bin, mean)
0 1
101.5746 102.5036

Related

When doing a t-test with t.test() while piping, what does the period do in "data=."?

I have a data frame with 2 columns being age and sex. I'm doing statistical analysis to determine if there's a difference in the age distribution in the two groups of sex. I know that if I don't call data= it will give an error (I believe it's something w/ the dplyr library). I was wondering what the single . in the data parameter does. Does it direct it to the data frame we used before the %>% ?
age_sex.htest <- d %>%
t.test(formula=age~sex, data=.)
As #markus has pointed out, d is passed to the data argument in t.test. Here is the output from data(sleep) using the ..
library(dplyr)
data(sleep)
sleep %>% t.test(formula=extra ~ group, data = .)
# Output
Welch Two Sample t-test
data: extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-3.3654832 0.2054832
sample estimates:
mean in group 1 mean in group 2
0.75 2.33
If you put sleep directly into data of t.test, then you will get the same result, as t.test is running the exact same data.
t.test(formula=extra ~ group, data = sleep)
# Output
Welch Two Sample t-test
data: extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-3.3654832 0.2054832
sample estimates:
mean in group 1 mean in group 2
0.75 2.33
In this case, the . is not that beneficial, though some people prefer this stylistically (I generally do).
However, it is extremely useful when you want to run the analysis on a slight alteration of the dataframe. So, with the sleep dataset, for example, if you wanted to remove ID == 10 from both groups, then you could remove those with filter, and then run the t.test.
sleep %>%
filter(ID != 10) %>%
t.test(formula = extra ~ group, data = .)
So, we pass an altered version of the sleep dataset without the rows where ID is 10. So now, we will see a change in the output:
Welch Two Sample t-test
data: extra by group
t = -1.7259, df = 15.754, p-value = 0.1039
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-3.5677509 0.3677509
sample estimates:
mean in group 1 mean in group 2
0.6111111 2.2111111

survey weighted t test not showing correct difference in mean in r?

I am doing an analysis with complex survey data in R. However when I use svyttest from the survey package to preform a design base t-test, it is not providing the correct difference in mean
svyby(~preds,~SDDSRVYR,svymean, design=subset(data, age==2))
SDDSRVYR preds se
7 7 0.2340050 0.01161363
10 10 0.3159294 0.01076532
tt<-svyttest(preds~SDDSRVYR, design=subset(data, age==2))
> tt
Design-based t-test
data: preds ~ SDDSRVYR
t = 5.1734, df = 30, p-value = 1.428e-05
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
0.01696236 0.03765392
sample estimates:
difference in mean
0.02730814
As you can see, the difference in means is about 0.082, but the t test is showing its 0.03. Am I not understanding how the t-test is calculating the means? I can't imagine it would be any different than svymean...Or perhaps this is a coding issue?
I found the answer-SDDSRVYR was being treated as continuous (it takes the values 7 and 10). Not binary
svyttest(preds~factor(SDDSRVYR), design=subset(data, age==2))
Design-based t-test
data: preds ~ factor(SDDSRVYR)
t = 5.1734, df = 30, p-value = 1.428e-05
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
0.05088707 0.11296176
sample estimates:
difference in mean
0.08192442

T.test on a small set of data from a large set

we compare the average (or mean) of one group against the set average (or mean). This set average can be any theoretical value (or it can be the population mean).
I am trying to compute the average mean of a small group of 300 observations against 1500 observations using one sided t.test.Is this approach correct? If not is there an alternative to this?
head(data$BMI)
attach(data)
tester<-mean(BMI)
table(BMI)
set.seed(123)
sampler<-sample(min(BMI):max(BMI),300,replace = TRUE)
mean(sampler)
t.test(sampler,tester)
The last line of the code yield-
Error in t.test.default(sampler, tester) : not enough 'y' observations
For testing your sample in t.test, you can do:
d <- rnorm(1500,mean = 3, sd = 1)
s <- sample(d,300)
Then, test for the normality of d and s:
> shapiro.test(d)
Shapiro-Wilk normality test
data: d
W = 0.9993, p-value = 0.8734
> shapiro.test(s)
Shapiro-Wilk normality test
data: s
W = 0.99202, p-value = 0.1065
Here the test is superior to 0.05, so you could consider that both d and s are normally distributed. So, you can test for t.test:
> t.test(d,s)
Welch Two Sample t-test
data: d and s
t = 0.32389, df = 444.25, p-value = 0.7462
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.09790144 0.13653776
sample estimates:
mean of x mean of y
2.969257 2.949939

t.test error using ddply; count matches yet says not many observations [duplicate]

I got this error for my code
Error in t.test.default(dat$Value, x[[i]][[2]]) :
not enough 'y' observations
I think the reason I got this error is because I'm doing a t.test for data that only has one value (so there wouldnt be a mean or an sd) vs data that has 20 values..is there a way I can get around this.. is there a way I can ignore the data that doesn't have enough y observations??? like an if loop might work...pls help
So my code that does the t.test is
t<- lapply(1:length(x), function(i) t.test(dat$Value,x[[i]][[2]]))
where x is data in the form of cuts similar to
cut: [3:8)
Number Value
3 15 8
4 16 7
5 17 6
6 19 2.3
this data goes on
cut:[9:14)
Number Value
7 21 15
cut:[13:18) etc
Number Value
8 22 49
9 23 15
10 24 13
How can I ignore 'cuts' that have only 1 value in them like the example above where in 'cut[9:14)' theres only one value....
All standard variants of t-test use sample variances in their formulas, and you cannot compute that from one observation as you are dividing with n-1, where n is sample size.
This would probably be the easiest modification, although I cannot test it as you did not provide sample data (you could dput your data to your question):
t<- lapply(1:length(x), function(i){
if(length(x[[i]][[2]])>1){
t.test(dat$Value,x[[i]][[2]])
} else "Only one observation in subset" #or NA or something else
})
Another option would be to modify the indices which are used in lapply:
ind<-which(sapply(x,function(i) length(i[[2]])>1))
t<- lapply(ind, function(i) t.test(dat$Value,x[[i]][[2]]))
Here's an example of the first case with artificial data:
x<-list(a=cbind(1:5,rnorm(5)),b=cbind(1,rnorm(1)),c=cbind(1:3,rnorm(3)))
y<-rnorm(20)
t<- lapply(1:length(x), function(i){
if(length(x[[i]][,2])>1){ #note the indexing x[[i]][,2]
t.test(y,x[[i]][,2])
} else "Only one observation in subset"
})
t
[[1]]
Welch Two Sample t-test
data: y and x[[i]][, 2]
t = -0.4695, df = 16.019, p-value = 0.645
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.2143180 0.7739393
sample estimates:
mean of x mean of y
0.1863028 0.4064921
[[2]]
[1] "Only one observation in subset"
[[3]]
Welch Two Sample t-test
data: y and x[[i]][, 2]
t = -0.6213, df = 3.081, p-value = 0.5774
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.013287 2.016666
sample estimates:
mean of x mean of y
0.1863028 0.6846135
Welch Two Sample t-test
data: y and x[[i]][, 2]
t = 5.2969, df = 10.261, p-value = 0.0003202
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.068071 7.496963
sample estimates:
mean of x mean of y
5.5000000 0.2174829

Why am I getting a different cohen's D using the formula and using a function in R?

I'm comparing X (a 2 category variable) and Y (let's say age). First, I did a t.test(Y~X), which returned
Welch Two Sample t-test
data: Y by X
t = 1.0579, df = 680.889, p-value = 0.2905
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.101452 3.674986
sample estimates:
mean in group 0 mean in group 1
47.07576 45.78899
Next, I used the cohensD(X, Y) function from the package lsr. It returned 4.017.
To check that I used a Cohen's D calculator, which uses the equation Cohen's D = 2t/sqrt(df) = 1.0579*2/sqrt(680.889) = .081.
Why am I getting .081 using the formula and 4.017 using lsr?

Resources