Performing one sample t test with given mean and sd - r

I'm trying to perform a one sample t test with given mean and sd.
Sample mean is 100.5
population mean is 100
population standard deviation is 2.19
and sample size is 50.
although it is relatively simple to perfrom a t-test with a datasheet, I don't know how to perform a t test with given data.
What could be the easiest way to write this code?
I would like to get my t test value, df value and my p-value just like what the code t.test() gives you.
I saw another post similar to this. but it didn't have any solutions.
I couldn't find any explanation for performing one sample t test with given mean and sd.

Since the parameters of the population is known (mu=100, sigma=2.19) and your sample size is greater than 30 (n=50), it is possible to perform either z-test or t-test. However, the base R doesn't have any function to do z-test. There is a z.test() function in the package BSDA (Arnholt and Evans, 2017):
library(BSDA)
z.test (
x = sample_data # a vector of your sample values
,y= NULL # since you are performing one-sample test
,alternative = "two.sided"
,mu = 100 # specified in the null hypothesis
,sigma.x = 2.19
)
Similarly the t.test can be performed using base R function t.test():
t.test(sample_data
, mu=100
, alternative = "two-sided"
)
The question is that which test we might consider to interpret our result?
The z-test is only an approximate test (i.e. the data will only be approximately Normal), unlike the t-test, so if there is a choice between a z-test and a t-test, it is recommended to choose t-test (The R book by Michael J Crawley and colleagues).
Also choosing between one-sided or two-sided is important,If
the difference between μ and μ0 is not known, a two-sided
test should be used. Otherwise, a one-sided test is used.
Hope this could helps.

Related

How to calculate p-values for each feature in R using two sample t-test

I have two data frames cases and controls and I performed two sample t-test as shown below.But I am doing feature extraction from the feature set of (1299 features/columns) so I want to calculate p-values for each feature. Based on the p-value generated for each feature I want to reject or accept the null hypothesis.
Can anyone explain to me how the below output is interpreted and how to calculate the p-values for each feature?
t.test(New_data_zero,New_data_one)
Welch Two Sample t-test
data: New_data_zero_pca and New_data_one_pca
t = -29.086, df = 182840000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.02499162 -0.02183612
sample estimates:
mean of x mean of y
0.04553462 0.06894849
Look at ?t.test. x and y are supposed to be vectors not matrixes. So the function is automatically converting them to vectors. What you want to do, assuming that columns are features and the two matrixes have the same features, is:
pvals=vector()
for (i in seq(ncol(New_data_zero))){
pvals[i]=t.test(New_data_zero[,i], New_data_one[,i])$p.value
}
Then you can look at pvals (probably in log scale) and after multiple hypothesis testing correction (see ?p.adjust).
Let's also address the enormously bad idea of this approach to finding differences among your features. Even if all of the effects between these 1299 features are literally zero you will find *significant results in 0.05 of all possible 1299 2-way comparisons which makes this strategy effectively meaningless. I would strongly suggest taking a look at an introductory statistics text, especially the section on family-wise type I error rates before proceeding.

Critical Value for Shapiro Wilk test

I am trying to get the critical W value for a Shapiro Wilk Test in R.
Shapiro-Wilk normality test
data: samplematrix[, 1]
W = 0.69661, p-value = 7.198e-09
With n=50 and alpha=.05, I know that the critical value W=.947, by conducting the critical value table. However, how do I get this critical value, using R?
Computing critical values directly is not easy (see this CrossValidated answer); what I've got here is essentially the same as what's in that answer (although I came up with it independently, and it improves on that answer slightly by using order statistics rather than random samples). The idea is that we can make a sample progressively more non-Normal until it gets exactly the desired p-value (0.05 in this case), then see what W-statistic corresponds to that sample.
## compute S-W for a given Gamma shape parameter and sample size
tmpf <- function(gshape=20,n=50) {
shapiro.test(qgamma((1:n)/(n+1),scale=1,shape=gshape))
}
## find shape parameter that corresponds to a particular p-value
find.shape <- function(n,alpha) {
uniroot(function(x) tmpf(x,n)$p.value-alpha,
interval=c(0.01,100))$root
}
find.W <- function(n,alpha) {
s <- find.shape(n,alpha)
tmpf(s,n=n)$statistic
}
find.W(50,0.05)
The answer (0.9540175) is not quite the same as the answer you got, because R uses an approximation to the Shapiro-Wilk test. As far as I know, the actual S-W critical value tables stem entirely from Shapiro and Wilk 1965 Biometrika http://www.jstor.org/stable/2333709 p. 605, which says only "Based on fitted Johnson (1949) S_B approximation, see Shapiro and Wilk 1965a for details" - and "Shapiro and Wilk 1965a" refers to an unpublished manuscript! (S&W essentially sampled many Normal deviates, computed the SW statistic, constructed smooth approximations of the SW statistic over a range of values, and took the critical values from this distribution).
I also tried to do this by brute force, but (see below) if we want to be naive and not do curve-fitting as SW did, we will need much larger samples ...
find.W.stoch <- function(n=50,alpha=0.05,N=200000,.progress="none") {
d <- plyr::raply(N,.Call(stats:::C_SWilk,sort(rnorm(n))),
.progress=.progress)
return(quantile(d[1,],1-alpha))
}
Compare original S&W values (transcribed from the papers) with the R approximation:
SW1965 <- c(0.767,0.748,0.762,0.788,0.803,0.818,0.829,0.842,
0.850,0.859,0.866,0.874,0.881,0.887,0.892,0.897,0.901,0.905,
0.908,0.911,0.914,0.916,0.918,0.920,0.923,0.924,0.926,0.927,
0.929,0.930,0.931,0.933,0.934,0.935,0.936,0.938,0.939,0.940,
0.941,0.942,0.943,0.944,0.945,0.945,0.946,0.947,0.947,0.947)
Rapprox <- sapply(3:50,find.W,alpha=0.05)
Rapprox.stoch <- sapply(3:50,find.W.stoch,alpha=0.05,.progress="text")
par(bty="l",las=1)
matplot(3:50,cbind(SW1965,Rapprox,Rapprox.stoch),col=c(1,2,4),
type="l",
xlab="n",ylab=~W[crit])
legend("bottomright",col=c(1,2,4),lty=1:3,
c("SW orig","R approx","stoch"))

ANOVA (AOV function) in R: Misleading p_value reported on equal values

I would greatly appreciate any guidance on the following: I am running ANOVA (aov) to retrieve p_value s for a number of subsets of a larger data set. So I kind of bumped into a subset where my numeric variables/values are equally 36. Because it is a part of a loop ANOVA is still executed along with reporting an seemingly infinitely small p_value 1.2855e-134--> Correct me if I am wrong but the smaller the p_value the higher the probability that the difference between the factors is significantly different?
For simplicity this is the subset:
sUBSET_FOR_ANOVA
Here is how I calculate ANOVA and retrieve p_value, where TEMP_DF2 is just the subset you see attached:
#
anova_sweep <- aov(TEMP_DF2$GOOD_PTS~TEMP_DF2$MACH,data = TEMP_DF2)
p_value <- summary(anova_sweep)[[1]][["Pr(>F)"]]
p_value <- p_value[1]
#
Many thanks for any guidance,
I can't replicate your findings. Let's produce an example dataset with all values being 36:
df <- data.frame(gr = rep(letters[1:2], 100),
y = 36)
summary(aov(y~gr, data = df))
Gives:
Df Sum Sq Mean Sq F value Pr(>F)
gr 1 1.260e-27 1.262e-27 1 0.319
Residuals 198 2.499e-25 1.262e-27
Basically, depending on the sample size, we obtain a p-value around 0.3 or so. The F statistic is (by definition) always 1, since the between and within group variances are equal.
Are there results misleading? To some extent, yes. The estimated SS and MS should be 0, aov calculates them as very very small. Some other statistical tests in R and in some packages check for zero variance and would produce an error, but aov apparently does not.
However, more importantly, I would say your data is violating the assumptions of the ANOVA and therefore any result cannot be trusted to base conclusion on. The expectation in R when it comes to statistical tests is usually that it is upon the user to employ the tests in the correct circumstances.

Perform a Shapiro-Wilk Normality Test

I want to perform a Shapiro-Wilk Normality Test test. My data is csv format. It looks like this:
heisenberg
HWWIchg
1 -15.60
2 -21.60
3 -19.50
4 -19.10
5 -20.90
6 -20.70
7 -19.30
8 -18.30
9 -15.10
However, when I perform the test, I get:
shapiro.test(heisenberg)
Error in [.data.frame(x, complete.cases(x)) :
undefined columns selected
Why isnt`t R selecting the right column and how do I do that?
What does shapiro.test do?
shapiro.test tests the Null hypothesis that "the samples come from a Normal distribution" against the alternative hypothesis "the samples do not come from a Normal distribution".
How to perform shapiro.test in R?
The R help page for ?shapiro.test gives,
x - a numeric vector of data values. Missing values are allowed,
but the number of non-missing values must be between 3 and 5000.
That is, shapiro.test expects a numeric vector as input, that corresponds to the sample you would like to test and it is the only input required. Since you've a data.frame, you'll have to pass the desired column as input to the function as follows:
> shapiro.test(heisenberg$HWWIchg)
# Shapiro-Wilk normality test
# data: heisenberg$HWWIchg
# W = 0.9001, p-value = 0.2528
Interpreting results from shapiro.test:
First, I strongly suggest you read this excellent answer from Ian Fellows on testing for normality.
As shown above, the shapiro.test tests the NULL hypothesis that the samples came from a Normal distribution. This means that if your p-value <= 0.05, then you would reject the NULL hypothesis that the samples came from a Normal distribution. As Ian Fellows nicely put it, you are testing against the assumption of Normality". In other words (correct me if I am wrong), it would be much better if one tests the NULL hypothesis that the samples do not come from a Normal distribution. Why? Because, rejecting a NULL hypothesis is not the same as accepting the alternative hypothesis.
In case of the null hypothesis of shapiro.test, a p-value <= 0.05 would reject the null hypothesis that the samples come from normal distribution. To put it loosely, there is a rare chance that the samples came from a normal distribution. The side-effect of this hypothesis testing is that this rare chance happens very rarely. To illustrate, take for example:
set.seed(450)
x <- runif(50, min=2, max=4)
shapiro.test(x)
# Shapiro-Wilk normality test
# data: runif(50, min = 2, max = 4)
# W = 0.9601, p-value = 0.08995
So, this (particular) sample runif(50, min=2, max=4) comes from a normal distribution according to this test. What I am trying to say is that, there are many many cases under which the "extreme" requirements (p < 0.05) are not satisfied which leads to acceptance of "NULL hypothesis" most of the times, which might be misleading.
Another issue I'd like to quote here from #PaulHiemstra from under comments about the effects on large sample size:
An additional issue with the Shapiro-Wilk's test is that when you feed it more data, the chances of the null hypothesis being rejected becomes larger. So what happens is that for large amounts of data even very small deviations from normality can be detected, leading to rejection of the null hypothesis event though for practical purposes the data is more than normal enough.
Although he also points out that R's data size limit protects this a bit:
Luckily shapiro.test protects the user from the above described effect by limiting the data size to 5000.
If the NULL hypothesis were the opposite, meaning, the samples do not come from a normal distribution, and you get a p-value < 0.05, then you conclude that it is very rare that these samples do not come from a normal distribution (reject the NULL hypothesis). That loosely translates to: It is highly likely that the samples are normally distributed (although some statisticians may not like this way of interpreting). I believe this is what Ian Fellows also tried to explain in his post. Please correct me if I've gotten something wrong!
#PaulHiemstra also comments about practical situations (example regression) when one comes across this problem of testing for normality:
In practice, if an analysis assumes normality, e.g. lm, I would not do this Shapiro-Wilk's test, but do the analysis and look at diagnostic plots of the outcome of the analysis to judge whether any assumptions of the analysis where violated too much. For linear regression using lm this is done by looking at some of the diagnostic plots you get using plot(lm()). Statistics is not a series of steps that cough up a few numbers (hey p < 0.05!) but requires a lot of experience and skill in judging how to analysis your data correctly.
Here, I find the reply from Ian Fellows to Ben Bolker's comment under the same question already linked above equally (if not more) informative:
For linear regression,
Don't worry much about normality. The CLT takes over quickly and if you have all but the smallest sample sizes and an even remotely reasonable looking histogram you are fine.
Worry about unequal variances (heteroskedasticity). I worry about this to the point of (almost) using HCCM tests by default. A scale location plot will give some idea of whether this is broken, but not always. Also, there is no a priori reason to assume equal variances in most cases.
Outliers. A cooks distance of > 1 is reasonable cause for concern.
Those are my thoughts (FWIW).
Hope this clears things up a bit.
You are applying shapiro.test() to a data.frame instead of the column. Try the following:
shapiro.test(heisenberg$HWWIchg)
You failed to specify the exact columns (data) to test for normality.
Use this instead
shapiro.test(heisenberg$HWWIchg)
Set the data as a vector and then place in the function.

how to make a T-Test in R in GDS files?

My project is about predicting biomarker breast cancer.
I use this function to give me a 2x2 matrix:
Table(gpl96)[1:10,1:4]
I want to take this data that represents the samples of genes in GDS and compare the p-value to know if it is normally distributed or not.
t.test tests whether there is a difference in location between two samples that adhere to normal distributions.
To approximately check the assumed normality, you might inspect whether outputs from qqnorm seem linear, or use ks.test in conjunction with estimating parameters from observations*:
set.seed(1)
x1 <- rnorm(200,40,10) # should follow a normal distribution
ks.test(x1,"pnorm",mean=mean(x1),sd=sd(x1)) # p: 0.647 [qqnorm(x1) looks linear]
x2 <- rexp(200,10) # should *not* follow a normal distribution
ks.test(x2,"pnorm",mean=mean(x2),sd=sd(x2)) # p: 3.576e-05, [qqnorm(x2) seems curved]
I do not know GEO's Table, but I suggest you might want to use its VALUE columns -and not any 2x2 matrices- as inputs for t.test, qqnorm or ks.test; maybe you might provide some additional illustration of your data by posting outputs of head(Table(gpl96)[1:10,1:4]).
(* After https://stat.ethz.ch/pipermail/r-help/2003-October/040692.html, which also appears to demonstrate the more refined Lilliefors test.)

Resources