Hypothesis testing for three groups

Hypothesis testing for three groups - r

Based on the data, is the average sale amount statistically the same for the A, B, and C groups?
I performed t.test on AB, BC, CA. for CA, p-value>0.05, so I concluded for CA, we can't reject null hypothesis, and average may be same.
H1- alternative hypothesis was - true difference in means between group 36-45 and group 46-50 is not equal to 0
My Question is - Did I do this correctly or is there another way to check the hypothesis for three groups

If the population means of the groups are denoted mu_A, mu_B, and mu_C, then you are actually interested in the single joint null hypothesis: H_0: mu_A=mu_B=mu_C. The problem with conducting three pairwise test is the fact that it is difficult to control the probability of the type I error. That is, how do you know that three test at a significance level of 5% will still reject the H_0 above with 5% probability if this H_0 is true?
The test you are looking for is called an Analysis of Variance (ANOVA) test. It will provide a single test statistic and a single p-value to test the hypothesis above. If you search for "ANOVA statistical test", then Google will suggest many explanations (and probably also appropriate commands to do the analysis in R). I hope this helps.

Related

Effect Size for Wilcox-Mann-Whitney in R Cohen's d

I have a non-normal distribution data of two independent samples os patients divides by two groups 'control' and 'treatment'.
Would like to verify if there are difference between the two groups 'control' and 'treatment' and measure this difference, so I am using the code:
wilcox.test(data.to.work$disease ~ data.to.work$group)
Ok for my test, my doubt is: I can use Cohen's d to measure de effect size?
I also did a test using the codes:
cohens_d(data.to.work$disease ~ data.to.work$group)
rcompanion::wilcoxonR(data.to.work$disease, g=data.to.work$group, ci=T)
Both give large effect size.
May I use Cohen's d ? Or is the second test the most recommended, or some other?
Tks

It would be unusual to pair Cohen's d with a Wilcoxon-Mann-Whitney test. There's no reason that you couldn't calculate it, but probably, if you are choose the WMW test, you wouldn't be that interested in comparing the difference in means of the samples to the pooled standard deviation.
A typical standardized effect size statistic for the WMW test is based on the probability of an observation in one group being larger than an observation in the other group.
These standardized effect size statistics include Vargha and Delaney’s A, Cliff’s delta, Glass rank biserial coefficient, and Grissom and Kim's Probability of Superiority.
Rather than using the wilcoxonR() function, I would recommend a using a different function in that package that calculates one of the effect size statistics mentioned above.

Calculate vaccine efficacy confidence Interval using the exact method

I'm trying to calculate confidence intervals for vaccine efficacy studies.
All the studies I am looking at claim that they use the Exact method and cite this free PDF: Statistical Methods in Cancer Research Volume II: The Design and Analysis of Cohort Studies It is my understanding that the exact method is also sometimes called the Clopper Pearson method.
The data I have is: Person-years of vaccinated, Person-years of unvaccinated, Number of cases among vaccinated, Number of cases among unvaccinated,
Efficacy is easy to calculate: 1 - ( (Number of cases among vaccinated/Person-years of vaccinated) / (Number of cases among unvaccinated/Person-years of unvaccinated) ) * 100
But calculating the confidence interval is harder.
At first I thought that this website gave the code I needed:
testall <- binom.test(8, 8+162)
(theta <- testall$conf.int)
(VE <- (1-2*theta)/(1-theta))
In this example, 8 is the number of cases in the vaccinated group and 162 is the number of cases in the unvaccinated group. But I have had a few problems with this.
(1) there are some studies where the size of the two cohorts (vaccinated vs. not vaccinated) are different. I don't think that this code works for those cohorts.
(2) I want to be able to adjust the type of confidence interval. For example, one study used "one-sided α risk of 2·5%" where as another study used "a two-sided α level of 5%". I'm not clear if this effects the numbers.
Either way, when I tried to run the numbers, it didn't work.
Here is an example of a data sets I am trying to validate:
Number of cases among vaccinated: 176
Number of cases among unvaccinated: 221
Person-years of vaccinated: 11,793
Person-years of unvaccinated: 5,809
Efficacy: 60.8 95%
Two sided 95% CI: 52.0–68.0

How to get tabulated interval of Wilcoxon-Mann-Whitney rank sum test

I was reading this topic on Rbloggers about the use of the Wilcoxon rank sum test: https://www.r-bloggers.com/wilcoxon-mann-whitney-rank-sum-test-or-test-u/
Especially this part, here I quote:
"We can finally compare the intervals tabulated on the tables of Wilcoxon for independent samples. The tabulated interval for two groups of 6 samples each is (26, 52)".
How can I get these "tabulated" values ?
I understand they used a table where the values are reported following the size of each samples, but I was wondering if there was a way to get them in R.
It is important because as I can understand the post, once you have a p-value > 0.05 and so cannot reject the null hypothesis H0, you can actually confirm H0 by comparing "computed" and "tabulated" intervals.
So what I would need is the tabulated intervals, using R.

tl;dr
You can get confidence intervals for a Mann-Whitney-Wilcoxon test by specifying conf.int=TRUE.
Don't believe everything you read on the internet ...
If by "confirm" you mean "make sure that the computation is true", you don't need to double-check by consulting the original tables; the p-value should be enough to decide whether you can reject H0 or not. You can trust R for standard, widely used statistical methods. (I also show below how to repeat the computation with a different implementation from the coin package, which is a nearly independent check.)
if by "confirm" you mean "accept the null hypothesis", please don't do this; this is a fundamental violation of frequentist statistical theory, which says that you can reject a null hypothesis, but that you can never accept the null. Wide confidence intervals and p-values greater than a given threshold are evidence that the conclusion is uncertain (we can't be sure whether the null or the alternative is true), not that the null is true. The concluding text of the blog post referred to ("we conclude by accepting the hypothesis H0 of equality of means") is statistically incorrect.
A better way to interpret the uncertainty is to look at the confidence intervals. You can compute these for the Wilcoxon test: from ?wilcox.test:
... (if argument ‘conf.int’ is true [and a two-sample test is being performed]), a nonparametric
confidence interval and an estimator for ... the difference of the location parameters
‘x-y’ is computed.
> a = c(6, 8, 2, 4, 4, 5)
> b = c(7, 10, 4, 3, 5, 6)
> wilcox.test(b,a, conf.int=TRUE, correct=FALSE)
data: b and a
W = 22, p-value = 0.5174
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-1.999975 4.000016
sample estimates:
difference in location
0.9999395
The high p-value (0.5174) says that we really can't tell whether the values in a or b have signicantly different ranks. The difference in location gives us the estimated difference between the median ranks, and the confidence interval gives the confidence interval on this difference. In this case, for a sample size of 12, the estimated difference in ranks is 1 (group b has slightly higher ranks than group a), and the confidence interval is (-2, 4) (the data are consistent with group b having slightly lower or much higher ranks than group a). It is admittedly rather difficult to interpret the substantive meaning of these values - that's one of the disadvantages of rank-based nonparametric tests ...
You can assume that the p-value computed by wilcox.test() is a reasonable summary of the evidence against the null hypothesis; there's no need to look up ranges in the tables. If you're worried about wilcox.test() in base R, you can try wilcox_test() from the coin package:
dd <- data.frame(f=rep(c("a","b"),each=6),x=c(a,b))
wilcox_test(x~f,data=dd,conf.int=TRUE) ## asymptotic test
which gives nearly identical results to wilcox.test(), and
wilcox_test(x~f,data=dd,conf.int=TRUE, distribution="exact")
which gives a slightly different p-value, but essentially the same confidence intervals.
of historical interest only
As for the tables: I found them on Google books, by doing a Google Scholar search with author:katti author:wilcox. There you can read the description of how they were computed; this wouldn't be impossible to replicate, but it seems unnecessary since p-values and confidence intervals are available via other methods. Digging through you find this:
The number 0.0206 in the red box indicates that the interval (26,52) corresponds to a one-tail p-value of 0.0206 (2-tailed = 0.0412); that's the closest you can get with a discrete range. The next closest range is given in the line below [(27,51), one-tailed p=0.0325, two-tailed=0.065]. In the 21st century you should never have to do this procedure.

How to calculate p-values for each feature in R using two sample t-test

I have two data frames cases and controls and I performed two sample t-test as shown below.But I am doing feature extraction from the feature set of (1299 features/columns) so I want to calculate p-values for each feature. Based on the p-value generated for each feature I want to reject or accept the null hypothesis.
Can anyone explain to me how the below output is interpreted and how to calculate the p-values for each feature?
t.test(New_data_zero,New_data_one)
Welch Two Sample t-test
data: New_data_zero_pca and New_data_one_pca
t = -29.086, df = 182840000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.02499162 -0.02183612
sample estimates:
mean of x mean of y
0.04553462 0.06894849

Look at ?t.test. x and y are supposed to be vectors not matrixes. So the function is automatically converting them to vectors. What you want to do, assuming that columns are features and the two matrixes have the same features, is:
pvals=vector()
for (i in seq(ncol(New_data_zero))){
pvals[i]=t.test(New_data_zero[,i], New_data_one[,i])$p.value
}
Then you can look at pvals (probably in log scale) and after multiple hypothesis testing correction (see ?p.adjust).

Let's also address the enormously bad idea of this approach to finding differences among your features. Even if all of the effects between these 1299 features are literally zero you will find *significant results in 0.05 of all possible 1299 2-way comparisons which makes this strategy effectively meaningless. I would strongly suggest taking a look at an introductory statistics text, especially the section on family-wise type I error rates before proceeding.

Perform a Shapiro-Wilk Normality Test

I want to perform a Shapiro-Wilk Normality Test test. My data is csv format. It looks like this:
heisenberg
HWWIchg
1 -15.60
2 -21.60
3 -19.50
4 -19.10
5 -20.90
6 -20.70
7 -19.30
8 -18.30
9 -15.10
However, when I perform the test, I get:
shapiro.test(heisenberg)
Error in [.data.frame(x, complete.cases(x)) :
undefined columns selected
Why isnt`t R selecting the right column and how do I do that?

What does shapiro.test do?
shapiro.test tests the Null hypothesis that "the samples come from a Normal distribution" against the alternative hypothesis "the samples do not come from a Normal distribution".
How to perform shapiro.test in R?
The R help page for ?shapiro.test gives,
x - a numeric vector of data values. Missing values are allowed,
but the number of non-missing values must be between 3 and 5000.
That is, shapiro.test expects a numeric vector as input, that corresponds to the sample you would like to test and it is the only input required. Since you've a data.frame, you'll have to pass the desired column as input to the function as follows:
> shapiro.test(heisenberg$HWWIchg)
# Shapiro-Wilk normality test
# data: heisenberg$HWWIchg
# W = 0.9001, p-value = 0.2528
Interpreting results from shapiro.test:
First, I strongly suggest you read this excellent answer from Ian Fellows on testing for normality.
As shown above, the shapiro.test tests the NULL hypothesis that the samples came from a Normal distribution. This means that if your p-value <= 0.05, then you would reject the NULL hypothesis that the samples came from a Normal distribution. As Ian Fellows nicely put it, you are testing against the assumption of Normality". In other words (correct me if I am wrong), it would be much better if one tests the NULL hypothesis that the samples do not come from a Normal distribution. Why? Because, rejecting a NULL hypothesis is not the same as accepting the alternative hypothesis.
In case of the null hypothesis of shapiro.test, a p-value <= 0.05 would reject the null hypothesis that the samples come from normal distribution. To put it loosely, there is a rare chance that the samples came from a normal distribution. The side-effect of this hypothesis testing is that this rare chance happens very rarely. To illustrate, take for example:
set.seed(450)
x <- runif(50, min=2, max=4)
shapiro.test(x)
# Shapiro-Wilk normality test
# data: runif(50, min = 2, max = 4)
# W = 0.9601, p-value = 0.08995
So, this (particular) sample runif(50, min=2, max=4) comes from a normal distribution according to this test. What I am trying to say is that, there are many many cases under which the "extreme" requirements (p < 0.05) are not satisfied which leads to acceptance of "NULL hypothesis" most of the times, which might be misleading.
Another issue I'd like to quote here from #PaulHiemstra from under comments about the effects on large sample size:
An additional issue with the Shapiro-Wilk's test is that when you feed it more data, the chances of the null hypothesis being rejected becomes larger. So what happens is that for large amounts of data even very small deviations from normality can be detected, leading to rejection of the null hypothesis event though for practical purposes the data is more than normal enough.
Although he also points out that R's data size limit protects this a bit:
Luckily shapiro.test protects the user from the above described effect by limiting the data size to 5000.
If the NULL hypothesis were the opposite, meaning, the samples do not come from a normal distribution, and you get a p-value < 0.05, then you conclude that it is very rare that these samples do not come from a normal distribution (reject the NULL hypothesis). That loosely translates to: It is highly likely that the samples are normally distributed (although some statisticians may not like this way of interpreting). I believe this is what Ian Fellows also tried to explain in his post. Please correct me if I've gotten something wrong!
#PaulHiemstra also comments about practical situations (example regression) when one comes across this problem of testing for normality:
In practice, if an analysis assumes normality, e.g. lm, I would not do this Shapiro-Wilk's test, but do the analysis and look at diagnostic plots of the outcome of the analysis to judge whether any assumptions of the analysis where violated too much. For linear regression using lm this is done by looking at some of the diagnostic plots you get using plot(lm()). Statistics is not a series of steps that cough up a few numbers (hey p < 0.05!) but requires a lot of experience and skill in judging how to analysis your data correctly.
Here, I find the reply from Ian Fellows to Ben Bolker's comment under the same question already linked above equally (if not more) informative:
For linear regression,
Don't worry much about normality. The CLT takes over quickly and if you have all but the smallest sample sizes and an even remotely reasonable looking histogram you are fine.
Worry about unequal variances (heteroskedasticity). I worry about this to the point of (almost) using HCCM tests by default. A scale location plot will give some idea of whether this is broken, but not always. Also, there is no a priori reason to assume equal variances in most cases.
Outliers. A cooks distance of > 1 is reasonable cause for concern.
Those are my thoughts (FWIW).
Hope this clears things up a bit.

You are applying shapiro.test() to a data.frame instead of the column. Try the following:
shapiro.test(heisenberg$HWWIchg)

You failed to specify the exact columns (data) to test for normality.
Use this instead
shapiro.test(heisenberg$HWWIchg)

Set the data as a vector and then place in the function.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex