R `t.test` problem cause `159 equal to 8875` - r

R version is 3.5.3 (2019-03-11),have a look of below output:
> t.test(a$score,a$time,paired=FALSE)
Welch Two Sample t-test
data: a$score and a$time
t = -1.4861, df = 8382, p-value = 0.1373
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-20215.279 2781.535
sample estimates:
mean of x mean of y
159.0481 8875.9203
p value is 0.1373(>0.05),but the mean of 2 variables are 159.0481 and 8875.9203.
I upload .Rdata to https://file.io/EH9XV44u
Anything wrong with my t.test?

I think the title of your question shows the problem here.
The result that a mean of 159 in one set of data does not demonstrate a significant difference to a mean of 8875 in another set of data doesn't mean '159 equal to 8875'.
It just means that the data allow sufficient uncertainty about the 'true' values of the means (from which the data were picked) that you can't say with confidence that they are different.
Even though - intuitively - you might think that 159 'looks' very different to 8875, we perform a statistical test to verify (or refute) our intuition that this difference couldn't have arisen by chance. In this case, it seems that intuition is wrong.
As Edward & Hong Oui have said in the comments, this is probably because one (or both) of your datasets are very dispersed, so the mean alone doesn't reflect the amount of uncertainty.
An extreme example, which might make this clearer:
data1: c(7,105,365) = mean 149
data2: c(3,22,26600) = mean 8875
It's clear (to me) that we can't be very confident that data1 and data2 are really different, since the difference in the mean value arises from just a single high value in data2. So, although the means seem very different, we don't expect that this will be significant if we test it.
Indeed:
t.test(data1,data2)
# p-value = 0.4291
I guess that if you look closely at your own data you'll find something similar...

Related

How can I create a normal distributed set of data in R?

I'm a newbie in statistics and I'm studying R.
I decided to do this exercise to pratice some analysis with an original dataset.
This is the issue: I want to create a datset of let's say 100 subjects and for each one of them I have a test score.
This test score has a range that goes from 0 to 70 and the mean score is 48 (and its improbable that someone scores 0).
Firstly I tried to create the set with x <- round(runif(100, min=0, max=70)) , but then I found out that were not normally distributed using plot(x).
So I searched another Rcommand and found this, but I couldn't decide the min\max:
ex1 <- round(rnorm(100, mean=48 , sd=5))
I really can't understand what I have to do!
I would like to write a function that gives me a set of data normally distributed, in a range of 0-70, with a mean of 48 and a not so big standard deviation in order to do some T-test later...
Any help?
Thanks a lot in advance guys
The normal distribution, by definition, does not have a min or max. If you go more than a few standard deviations from the mean, the probability density is very small, but not 0. You can truncate a normal distribution, chopping of the tails. Here, I use pmin and pmax to set any values below 0 to 0, and any values above 70 to 70:
ex1 <- round(rnorm(100, mean=48 , sd=5))
ex1 <- pmin(ex1, 70)
ex1 <- pmax(ex1, 0)
You can calculate the probability of an individual observation being below or above a certain point using pnorm. For your mean of 48 and SD of 5, the probability an individual observation is less than 0 is very small:
pnorm(0, mean = 48, sd = 5)
# [1] 3.997221e-22
This probability is so small that the truncation step is unnecessary in most applications. But if you started experimenting with bigger standard deviations, or mean values closer to the bounds, it could become necessary.
This method of truncation is simple, but it is a bit of a hack. If you truncated a distribution to be within 1 SD of the mean using this method, you would end up with spikes a the upper and lower bound that are even higher than the density at the mean! But it should work well enough for less extreme applications. A more robust method might be to draw more samples than you need, and keep the first n samples that fall within your bounds. If you really care to do things right, there are packages that implement truncated normal distributions.
(Because the normal distribution is symmetric, and 100 is farther from your mean than 0, the probability of observations > 100 are even smaller.)

How to get tabulated interval of Wilcoxon-Mann-Whitney rank sum test

I was reading this topic on Rbloggers about the use of the Wilcoxon rank sum test: https://www.r-bloggers.com/wilcoxon-mann-whitney-rank-sum-test-or-test-u/
Especially this part, here I quote:
"We can finally compare the intervals tabulated on the tables of Wilcoxon for independent samples. The tabulated interval for two groups of 6 samples each is (26, 52)".
How can I get these "tabulated" values ?
I understand they used a table where the values are reported following the size of each samples, but I was wondering if there was a way to get them in R.
It is important because as I can understand the post, once you have a p-value > 0.05 and so cannot reject the null hypothesis H0, you can actually confirm H0 by comparing "computed" and "tabulated" intervals.
So what I would need is the tabulated intervals, using R.
tl;dr
You can get confidence intervals for a Mann-Whitney-Wilcoxon test by specifying conf.int=TRUE.
Don't believe everything you read on the internet ...
If by "confirm" you mean "make sure that the computation is true", you don't need to double-check by consulting the original tables; the p-value should be enough to decide whether you can reject H0 or not. You can trust R for standard, widely used statistical methods. (I also show below how to repeat the computation with a different implementation from the coin package, which is a nearly independent check.)
if by "confirm" you mean "accept the null hypothesis", please don't do this; this is a fundamental violation of frequentist statistical theory, which says that you can reject a null hypothesis, but that you can never accept the null. Wide confidence intervals and p-values greater than a given threshold are evidence that the conclusion is uncertain (we can't be sure whether the null or the alternative is true), not that the null is true. The concluding text of the blog post referred to ("we conclude by accepting the hypothesis H0 of equality of means") is statistically incorrect.
A better way to interpret the uncertainty is to look at the confidence intervals. You can compute these for the Wilcoxon test: from ?wilcox.test:
... (if argument ‘conf.int’ is true [and a two-sample test is being performed]), a nonparametric
confidence interval and an estimator for ... the difference of the location parameters
‘x-y’ is computed.
> a = c(6, 8, 2, 4, 4, 5)
> b = c(7, 10, 4, 3, 5, 6)
> wilcox.test(b,a, conf.int=TRUE, correct=FALSE)
data: b and a
W = 22, p-value = 0.5174
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-1.999975 4.000016
sample estimates:
difference in location
0.9999395
The high p-value (0.5174) says that we really can't tell whether the values in a or b have signicantly different ranks. The difference in location gives us the estimated difference between the median ranks, and the confidence interval gives the confidence interval on this difference. In this case, for a sample size of 12, the estimated difference in ranks is 1 (group b has slightly higher ranks than group a), and the confidence interval is (-2, 4) (the data are consistent with group b having slightly lower or much higher ranks than group a). It is admittedly rather difficult to interpret the substantive meaning of these values - that's one of the disadvantages of rank-based nonparametric tests ...
You can assume that the p-value computed by wilcox.test() is a reasonable summary of the evidence against the null hypothesis; there's no need to look up ranges in the tables. If you're worried about wilcox.test() in base R, you can try wilcox_test() from the coin package:
dd <- data.frame(f=rep(c("a","b"),each=6),x=c(a,b))
wilcox_test(x~f,data=dd,conf.int=TRUE) ## asymptotic test
which gives nearly identical results to wilcox.test(), and
wilcox_test(x~f,data=dd,conf.int=TRUE, distribution="exact")
which gives a slightly different p-value, but essentially the same confidence intervals.
of historical interest only
As for the tables: I found them on Google books, by doing a Google Scholar search with author:katti author:wilcox. There you can read the description of how they were computed; this wouldn't be impossible to replicate, but it seems unnecessary since p-values and confidence intervals are available via other methods. Digging through you find this:
The number 0.0206 in the red box indicates that the interval (26,52) corresponds to a one-tail p-value of 0.0206 (2-tailed = 0.0412); that's the closest you can get with a discrete range. The next closest range is given in the line below [(27,51), one-tailed p=0.0325, two-tailed=0.065]. In the 21st century you should never have to do this procedure.

How to calculate p-values for each feature in R using two sample t-test

I have two data frames cases and controls and I performed two sample t-test as shown below.But I am doing feature extraction from the feature set of (1299 features/columns) so I want to calculate p-values for each feature. Based on the p-value generated for each feature I want to reject or accept the null hypothesis.
Can anyone explain to me how the below output is interpreted and how to calculate the p-values for each feature?
t.test(New_data_zero,New_data_one)
Welch Two Sample t-test
data: New_data_zero_pca and New_data_one_pca
t = -29.086, df = 182840000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.02499162 -0.02183612
sample estimates:
mean of x mean of y
0.04553462 0.06894849
Look at ?t.test. x and y are supposed to be vectors not matrixes. So the function is automatically converting them to vectors. What you want to do, assuming that columns are features and the two matrixes have the same features, is:
pvals=vector()
for (i in seq(ncol(New_data_zero))){
pvals[i]=t.test(New_data_zero[,i], New_data_one[,i])$p.value
}
Then you can look at pvals (probably in log scale) and after multiple hypothesis testing correction (see ?p.adjust).
Let's also address the enormously bad idea of this approach to finding differences among your features. Even if all of the effects between these 1299 features are literally zero you will find *significant results in 0.05 of all possible 1299 2-way comparisons which makes this strategy effectively meaningless. I would strongly suggest taking a look at an introductory statistics text, especially the section on family-wise type I error rates before proceeding.

Perform a Shapiro-Wilk Normality Test

I want to perform a Shapiro-Wilk Normality Test test. My data is csv format. It looks like this:
heisenberg
HWWIchg
1 -15.60
2 -21.60
3 -19.50
4 -19.10
5 -20.90
6 -20.70
7 -19.30
8 -18.30
9 -15.10
However, when I perform the test, I get:
shapiro.test(heisenberg)
Error in [.data.frame(x, complete.cases(x)) :
undefined columns selected
Why isnt`t R selecting the right column and how do I do that?
What does shapiro.test do?
shapiro.test tests the Null hypothesis that "the samples come from a Normal distribution" against the alternative hypothesis "the samples do not come from a Normal distribution".
How to perform shapiro.test in R?
The R help page for ?shapiro.test gives,
x - a numeric vector of data values. Missing values are allowed,
but the number of non-missing values must be between 3 and 5000.
That is, shapiro.test expects a numeric vector as input, that corresponds to the sample you would like to test and it is the only input required. Since you've a data.frame, you'll have to pass the desired column as input to the function as follows:
> shapiro.test(heisenberg$HWWIchg)
# Shapiro-Wilk normality test
# data: heisenberg$HWWIchg
# W = 0.9001, p-value = 0.2528
Interpreting results from shapiro.test:
First, I strongly suggest you read this excellent answer from Ian Fellows on testing for normality.
As shown above, the shapiro.test tests the NULL hypothesis that the samples came from a Normal distribution. This means that if your p-value <= 0.05, then you would reject the NULL hypothesis that the samples came from a Normal distribution. As Ian Fellows nicely put it, you are testing against the assumption of Normality". In other words (correct me if I am wrong), it would be much better if one tests the NULL hypothesis that the samples do not come from a Normal distribution. Why? Because, rejecting a NULL hypothesis is not the same as accepting the alternative hypothesis.
In case of the null hypothesis of shapiro.test, a p-value <= 0.05 would reject the null hypothesis that the samples come from normal distribution. To put it loosely, there is a rare chance that the samples came from a normal distribution. The side-effect of this hypothesis testing is that this rare chance happens very rarely. To illustrate, take for example:
set.seed(450)
x <- runif(50, min=2, max=4)
shapiro.test(x)
# Shapiro-Wilk normality test
# data: runif(50, min = 2, max = 4)
# W = 0.9601, p-value = 0.08995
So, this (particular) sample runif(50, min=2, max=4) comes from a normal distribution according to this test. What I am trying to say is that, there are many many cases under which the "extreme" requirements (p < 0.05) are not satisfied which leads to acceptance of "NULL hypothesis" most of the times, which might be misleading.
Another issue I'd like to quote here from #PaulHiemstra from under comments about the effects on large sample size:
An additional issue with the Shapiro-Wilk's test is that when you feed it more data, the chances of the null hypothesis being rejected becomes larger. So what happens is that for large amounts of data even very small deviations from normality can be detected, leading to rejection of the null hypothesis event though for practical purposes the data is more than normal enough.
Although he also points out that R's data size limit protects this a bit:
Luckily shapiro.test protects the user from the above described effect by limiting the data size to 5000.
If the NULL hypothesis were the opposite, meaning, the samples do not come from a normal distribution, and you get a p-value < 0.05, then you conclude that it is very rare that these samples do not come from a normal distribution (reject the NULL hypothesis). That loosely translates to: It is highly likely that the samples are normally distributed (although some statisticians may not like this way of interpreting). I believe this is what Ian Fellows also tried to explain in his post. Please correct me if I've gotten something wrong!
#PaulHiemstra also comments about practical situations (example regression) when one comes across this problem of testing for normality:
In practice, if an analysis assumes normality, e.g. lm, I would not do this Shapiro-Wilk's test, but do the analysis and look at diagnostic plots of the outcome of the analysis to judge whether any assumptions of the analysis where violated too much. For linear regression using lm this is done by looking at some of the diagnostic plots you get using plot(lm()). Statistics is not a series of steps that cough up a few numbers (hey p < 0.05!) but requires a lot of experience and skill in judging how to analysis your data correctly.
Here, I find the reply from Ian Fellows to Ben Bolker's comment under the same question already linked above equally (if not more) informative:
For linear regression,
Don't worry much about normality. The CLT takes over quickly and if you have all but the smallest sample sizes and an even remotely reasonable looking histogram you are fine.
Worry about unequal variances (heteroskedasticity). I worry about this to the point of (almost) using HCCM tests by default. A scale location plot will give some idea of whether this is broken, but not always. Also, there is no a priori reason to assume equal variances in most cases.
Outliers. A cooks distance of > 1 is reasonable cause for concern.
Those are my thoughts (FWIW).
Hope this clears things up a bit.
You are applying shapiro.test() to a data.frame instead of the column. Try the following:
shapiro.test(heisenberg$HWWIchg)
You failed to specify the exact columns (data) to test for normality.
Use this instead
shapiro.test(heisenberg$HWWIchg)
Set the data as a vector and then place in the function.

sample size for A/B fisher test significance

Given the results for a simple A / B test...
A B
clicked 8 60
ignored 192 1940
( ie a conversation rate of A 4% and B 3% )
... a fisher test in R quite rightly says there's no significant difference
> fisher.test(data.frame(A=c(8,192), B=c(60,1940)))
...
p-value = 0.3933
...
But what function is available in R to tell me how much I need to increase my sample size to get to a p-value of say 0.05?
I could just increase the A values (in their proportion) until I get to it but there's got to be a better way? Perhaps pwr.2p2n.test [1] is somehow usable?
[1] http://rss.acs.unt.edu/Rdoc/library/pwr/html/pwr.2p2n.test.html
power.prop.test() should do this for you. In order to get the math to work I converted your 'ignored' data to impressions by summing up your columns.
> power.prop.test(p1=8/200, p2=60/2000, power=0.8, sig.level=0.05)
Two-sample comparison of proportions power calculation
n = 5300.739
p1 = 0.04
p2 = 0.03
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
That gives 5301, which is for each group, so your sample size needs to be 10600. Subtracting out the 2200 that have already run, you have 8400 "tests" to go.
In this case:
sig.level is the same as your p-value.
power is the likelihood of finding significant results that exist within your sample. This is somewhat arbitrary, 80% is a common choice. Note that choosing 80% means that 20% of the time you won't find significance when you should. Increasing the power means you'll need a larger sample size to reach your desired significance level.
If you wanted to decide how much longer it will take to reach significance, divide 8400 by the number of impressions per day. That can help determine if its worth while to continue the test.
You can also use this function to determine required sample size before testing begins. There's a nice write-up describing this on the 37 Signals blog.
This is a native R function, so you won't need to add or load any packages. Other than that I can't say how similar this is to pwr.p2pn.test().

Resources