Is it possible to change Type I error threshold using t.test() function? - r

I am asked to compute a test statistic using the t.test() function, but I need to reduce the type I error. My prof showed us how to change a confidence level for this function, but not the acceptable type I error for null hypothesis testing. The goal is for the argument to automatically compute a p-value based on a .01 error rate rather than the normal .05.
The r code below involves a data set that I have downloaded.
t.test(mid$log_radius_area, mu=8.456)

I feel like I've answered this somewhere, but can't seem to find it on SO or CrossValidated.
Similarly to this question, the answer is that t.test() doesn't specify any threshold for rejecting/failing to reject the null hypothesis; it reports a p-value, and you get to decide whether to reject or not. (The conf.level argument is for adjusting which confidence interval the output reports.)
From ?t.test:
t.test(1:10, y = c(7:20))
Welch Two Sample t-test
data: 1:10 and c(7:20)
t = -5.4349, df = 21.982, p-value = 1.855e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.052802 -4.947198
sample estimates:
mean of x mean of y
5.5 13.5
Here the p-value is reported as 1.855e-05, so the null hypothesis would be rejected for any (pre-specified) alpha level >1.855e-05. Note that the output doesn't say anywhere "the null hypothesis is rejected at alpha=0.05" or anything like that. You could write your own function to do that, using the $p.value element that is saved as part of the test results:
report_test <- function(tt, alpha=0.05) {
cat("the null hypothesis is ")
if (tt$p.value > alpha) {
cat("**NOT** ")
}
cat("rejected at alpha=",alpha,"\n")
}
tt <- t.test(1:10, y = c(7:20))
report_test(tt)
## the null hypothesis is rejected at alpha= 0.05
Most R package/function writers don't bother to do this, because they figure that it should be simple enough for users to do for themselves.

Related

How do you set the alternative hypothesis text string in a `htest` object in R?

I am creating a hypothesis test in R as a htest object. I have managed to create the object I want, with the required estimate, test statistic and p-value. My only remaining problem is that the statement I want to give for my alternative hypothesis does not conform to the textual structure used in the printing method for an htest object. The setup for these objects seems to assume you have an alternative hypothesis that is a one-sided or two-sided test operating on an unknown parameter. It does not appear to accomodate more generate statements of alternative hypotheses, such as for goodness-of-fit tests. To be a bit more specific about my problem, here is the textual structure of the output print message for a htest object:
alternative hypothesis: true [name(null.value)] is [less than/equal to/greater than] [null.value]
I would like a more general print output like this:
alternative hypothesis: [character string]
When you create a htest object you can set name(null.value) and null.value to any character string you want, so it is possible to alter the start an end parts of the print message to anything you want. You can also set alternative to NA and this removes the middle part. However, the intermediate words "true" and "is" seem to be fixed. This means that you seem to be stuck with a message with the structure true [character string] is [character string].
My question: When creating a htest object, is there any way to get a print message for the alternative hypothesis that is an arbitrary character string? If so, what is the simplest way to do this?
As long as you set x$null.value <- NULL, it will print any string you construct for x$alternative
x <- t.test(1:10)
x$null.value <- NULL
x$alternative <- sprintf('%.2f on %s degrees of freedom, p %s',
x$statistic, x$parameter, format.pval(x$p.value, eps = 0.001))
x
# One Sample t-test
#
# data: 1:10
# t = 5.7446, df = 9, p-value = 0.0002782
# alternative hypothesis: 5.74 on 9 degrees of freedom, p < 0.001
# 95 percent confidence interval:
# 3.334149 7.665851
# sample estimates:
# mean of x
# 5.5

test for difference of means returns wrong result

I'm running the example of the R-intro manual:
A = c(79.98, 80.04, 80.02, 80.04, 80.03, 80.03, 80.04, 79.97, 80.05, 80.03, 80.02, 80.00, 80.02)
B = c(80.02, 79.94, 79.98, 79.97, 79.97, 80.03, 79.95, 79.97)
t.test(A, B)
Which produces the following result:
Welch Two Sample t-test
data: A and B
t = 3.2499, df = 12.027, p-value = 0.006939
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.01385526 0.07018320
sample estimates:
mean of x mean of y
80.02077 79.97875
The question is: if the difference of means is contained within the confidence interval (80.02077-79.97875=0.04202 and 0.01385526<0.04202<0.07018320) why does it conclude that the alternative hypothesis is true and not that the null hypothesis is true?
I think this is a language/interpretation problem. You are interpreting
alternative hypothesis: true difference in means is not equal to 0
as
The alternative hypothesis is true. The difference in means is not equal to 0
rather than (as intended)
The alternative hypothesis is: "the true difference in means is not equal to 0"
(According to strict frequentist logic we would never conclude "the alternative hypothesis is true", only that we can reject the null hypothesis.)
In order to evaluate the conclusions of the test, you should look at the 95% confidence interval (0.01385526, 0.07018320) and/or the p-value (0.0069). The procedure implemented in R does not follow a "Neyman-Pearson" style where you pre-specify an alpha level and dichotomize the result into "reject null hypothesis" or "fail to reject null hypothesis". If you want to do this, you can either just look at the p-value or, if you want R to do it for you,
alpha <- 0.05 ## or whatever your preferred cutoff is
t_result <- t.test(A,B)
t_result$p.value<alpha ## TRUE (reject null hypothesis)
Furthermore, your interpretation of the confidence interval is wrong. You should look to see whether the confidence interval includes zero; it will always be centred on the observed difference (so the observed difference will always be included in the 95% CI).

R - Error T-test For loop command between variables

Currently in the process of writing a For Loop that'll calculate and print t-test results, I'm testing for the difference in means of all variables (faminc, fatheduc, motheduc, white, cigtax, cigprice) between smokers and non-smokers ("smoke"; 0=non, 1=smoker)
Current code:
type <- c("faminc", "fatheduc", "motheduc", "white", "cigtax", "cigprice")
count <- 1
for(name in type){
temp <- subset(data, data[name]==1)
cat("For", name, "between smokers and non, the difference in means is: \n")
print(t.test(temp$smoke))
count <- count + 1
}
However, I feel that 'temp' doesn't belong here and when running the code I get:
For faminc between smokers and non, the difference in means is:
Error in t.test.default(temp$smoke) : not enough 'x' observations
The simple code of
t.test(faminc~smoke,data=data)
does what I need, but I'd like to get some practice/better understanding of for loops.
Here is a solution that generates the output requested in the OP, using lapply() with the mtcars data set.
data(mtcars)
varList <- c("wt","disp","mpg")
results <- lapply(varList,function(x){
t.test(mtcars[[x]] ~ mtcars$am)
})
names(results) <- varList
for(i in 1:length(results)){
message(paste("for variable:",names(results[i]),"difference between manual and automatic transmissions is:"))
print(results[[i]])
}
...and the output:
> for(i in 1:length(results)){
+ message(paste("for variable:",names(results[i]),"difference between manual and automatic transmissions is:"))
+ print(results[[i]])
+ }
for variable: wt difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = 5.4939, df = 29.234, p-value = 6.272e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.8525632 1.8632262
sample estimates:
mean in group 0 mean in group 1
3.768895 2.411000
for variable: disp difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = 4.1977, df = 29.258, p-value = 0.00023
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
75.32779 218.36857
sample estimates:
mean in group 0 mean in group 1
290.3789 143.5308
for variable: mpg difference between manual and automatic transmissions is:
Welch Two Sample t-test
data: mtcars[[x]] by mtcars$am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group 0 mean in group 1
17.14737 24.39231
>
Compare your code that works...
t.test(faminc~smoke,data=data)
You are specifying a relationship between variables (faminc~smoke) which means that you think the mean of faminc is different between the values of smoke and you wish to use the data dataset.
The equivalent line in your loop...
print(t.test(temp$smoke))
...only gives the single column of temp$smoke after having selected those who have the value 1 for each of faminc, fatheduc etc. So even if you wrote...
print(t.test(faminc~smoke, data=data))
Further your count is doing nothing.
If you want to perform a range of testes in this manner you could
type <- c("faminc", "fatheduc", "motheduc", "white", "cigtax", "cigprice")
for(name in type){
cat("For", name, "between smokers and non, the difference in means is: \n")
print(t.test(name~smoke, data=data))
}
Whether this is what you want to do though isn't clear to me, your variables suggest family (faminc), father (fatheduc), mother (motheduc), ethnicity (white), tax (cigtax) and price (cigprice).
I can't think why you would want to compare the mean cigarette price or tax between smokers and non-smokers, because the later are not going to have any value for this since they don't smoke!
You're code suggests these are perhaps binary variables though (since you are filtering on each value being 1) which to me suggests this isn't even what you want to do.
If you wish to look in subsets of data then a tidier approach to performing regression rather than loops is to use purrr.
In future when asking consider providing a sample of data along with the full copy & pasted output as advised in How to create a Minimal, Complete, and Verifiable example - Help Center - Stack Overflow. Because this allows people to see in greater detail what you are doing (e.g. I've only guessed about your variables). With statistics its also useful to state what your hypothesis is too to help people understand what it is you are trying to achieve overall.

Bootstrapping sample means in R using boot Package, Creating the Statistic Function for boot() Function

I have a data set with 15 density calculations, each from a different transect. I would like to resampled these with replacement, taking 15 randomly selected samples of the 15 transects and then getting the mean of these resamples. Each transect should have its own personal probability of being sampled during this process. This should be done 5000 times. I have a code which does this without using the boot function but if I want to calculate the BCa 95% CI using the boot package it requires the bootstrapping to be done through the boot function first.
I have been trying to create a function but I cant get any that seem to work. I want the bootstrap to select from a certain column (data$xs) and the probabilites to be used are in the column data$prob.
The function I thought might work was;
library(boot)
meanfun <- function (data, i){
d<-data [i,]
return (mean (d)) }
bo<-boot (data$xs, statistic=meanfun, R=5000)
#boot.ci (bo, conf=0.95, type="bca") #obviously `bo` was not made
But this told me 'incorrect number of dimensions'
I understand how to make a function in the normal sense but it seems strange how the function works in boot. Since the function is only given to boot by name, and no specification of the arguments to pass into the function I seem limited to what boot itself will pass in as arguments (for example I am unable to pass data$xs in as the argument for data, and I am unable to pass in data$prob as an argument for probability, and so on). It seems to really limit what can be done. Perhaps I am missing something though?
Thanks for any and all help
The reason for this error is, that data$xs returns a vector, which you then try to subset by data [i, ].
One way to solve this, is by changing it to data[i] or by using data[, "xs", drop = FALSE] instead. The drop = FALSE avoids type coercion, ie. keeps it as a data.frame.
We try
data <- data.frame(xs = rnorm(15, 2))
library(boot)
meanfun <- function(data, i){
d <- data[i, ]
return(mean(d))
}
bo <- boot(data[, "xs", drop = FALSE], statistic=meanfun, R=5000)
boot.ci(bo, conf=0.95, type="bca")
and obtain:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates
CALL :
boot.ci(boot.out = bo, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% ( 1.555, 2.534 )
Calculations and Intervals on Original Scale
One can use boot.array to extract all or a subset of the resampled sets. In this case:
bo.ci<-boot.ci(boot.out = bo, conf = 0.95, type = "bca")
resampled.data<-boot.array(bo,1)
To extract the first and second sets of resampled data:
resample.1<-resampled.data[1,]
resample.2<-resampled.data[2,]
Then proceed to extract the individual statistic you'd want from any subset. For isntance, If you assume normality you could run a student's t.test on teh first subset:
t.test(resample.1)
Which for this example and particular seed value(s) gives:
data: resample.1
t = 6.5216, df = 14, p-value = 1.353e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
5.234781 10.365219
sample estimates:
mean of x
7.8
r resampling boot.array

alternative for wilcox.test in R

I'm trying a significance test using wilcox.test in R. I want to basically test if a value x is significantly within/outside a distribution d.
I'm doing the following:
d = c(90,99,60,80,80,90,90,54,65,100,90,90,90,90,90)
wilcox.test(60,d)
Wilcoxon rank sum test with continuity correction
data: 60 and d
W = 4.5, p-value = 0.5347
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(60, d) : cannot compute exact p-value with ties
and basically the p-value is the same for a big range of numbers i test.
I've tried wilcox_test() from the coin package, but i can't get it to work testing a value against a distribution.
Is there an alternative to this test that does the same and knows how to deal with ties?
How worried are you about the non-exact results? I would guess that the approximation is reasonable for a data set this size. (I did manage to get coin::wilcox_test working, and the results are not hugely different ...)
d <- c(90,99,60,80,80,90,90,54,65,100,90,90,90,90,90)
pfun <- function(x) {
suppressWarnings(w <- wilcox.test(x,d)$p.value)
return(w)
}
testvec <- 30:120
p1 <- sapply(testvec,pfun)
library("coin")
pfun2 <- function(x) {
dd <- data.frame(y=c(x,d),f=factor(c(1,rep(2,length(d)))))
return(pvalue(wilcox_test(y~f,data=dd)))
}
p2 <- sapply(testvec,pfun2)
library("exactRankTests")
pfun3 <- function(x) {wilcox.exact(x,d)$p.value}
p3 <- sapply(testvec,pfun3)
Picture:
par(las=1,bty="l")
matplot(testvec,cbind(p1,p2,p3),type="s",
xlab="value",ylab="p value of wilcoxon test",lty=1,
ylim=c(0,1),col=c(1,2,4))
legend("topright",c("stats::wilcox.test","coin::wilcox_test",
"exactRankTests::wilcox.exact"),
lty=1,col=c(1,2,4))
(exactRankTests added by request, but given that it's not maintained any more and recommends the coin package, I'm not sure how reliable it is. You're on your own for figuring out what the differences among these procedures are and which would be best to use ...)
The results make sense here -- the problem is just that your power is low. If your value is completely outside the range of the data, for n=15, that will be a probability of something like 2*(1/16)=0.125 [i.e. probability of your sample ending up as the first or the last element in a permutation], which is not quite the same as the minimum value here (wilcox.test: p=0.105, wilcox_test: p=0.08), but that might be an approximation issue, or I might have some detail wrong. Nevertheless, it's in the right ballpark.
You can do this.
wilcox.test(60,d, exact=FALSE)

Resources