Permutation Distribution in r - r

So I need to create a permutation distribution of the difference in the proportions for a data set, however I'm not sure the best way to go about doing so.
This is the table that I need it for. I have to asses whether the difference between 2010 and 2011 is significant for "Yes".
mytable1 <- matrix(c(3648,25843,3407,26134), byrow=T, ncol=2)
dimnames(mytable1) <- list(c("2010","2011"),c("Yes","No"))
names(dimnames(mytable1)) <- c("Year","Response")
How do I code this in a for-loop?
Thank you so much!

Why use a permutation-based test if you can calculate exact probabilities? Is this a homework exercise?
fisher.test(mytable1);
Fisher's Exact Test for Count Data
data: mytable1
p-value = 0.001799
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.029882 1.138384
sample estimates:
odds ratio
1.082775
gives you the exact probabilitity (the p-value) for seeing a ratio of "Yes" to "No" in 2010 relative to 2011 (i.e. the odds ratio) as extreme or more extreme than what was observed. Note that the null hypothesis corresponds to an odds ratio of 1.
I assume this is what you mean by the "difference between 2010 and 2011 is significant [for Yes]". If not, please clarify and be more precise in specifying your test statistic (and null hypothesis). If it needs to be a permutation-based test, can you show us how far you have gotten?

Related

DHARMa outlier test is significant, what are my next steps?

I'm looking for information and guidance to help me understand the outlier test in DHARMa for negative binomial regression. Here is the diagnostic plot from DHARMa using the function simulateResiduals().
First off, The dispersion test is significant in the plot. Using testDispersion() on the model and on the residuals, I get the results of 2.495. Visually, the dots seem to aline pretty well on the QQ line. The developer stated ' If you see a dispersion parameter of 1.01, I would not worry, even if the test is significant. A significant value of 5, however, is clearly a reason to move to a model that accounts for overdispersion.' here I conclude that the deviation is within the acceptable range for the NB regression.
Second, the Outlier test is also significant. I never had this before, and I can't find much information regarding how many outliers is okay vs not okay to have. Following the recommendation of DHARMa's developer, I looked at the magnitude of the outlier to investigate this. reference. Here is the code and output:
ModelNB <- glm.nb(BUD ~ Treatment*YEAR, data=Data_Bud) simulationOutput <- simulateResiduals(fittedModel = ModelNB, plot = T) testOutliers(simulationOutput, type = "binomial")
`
DHARMa outlier test based on exact binomial test with
approximate expectations
data: simulationOutput
outliers at both margin(s) = 12, observations = 576, p-value =
0.00269
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
0.01081011 0.03610864
sample estimates:
frequency of outliers (expected: 0.00796812749003984 )
0.02083333
`
**Can someone help me understand this output? ** Is having 12 outliers per 576 observations okay? In statistics classes, I was told that taking out outliers was a big No-No. What does "true probability of success is not equal to 0.007968127" mean? I can't accept H1 and need to accept H0 for the outlier???
Information on my model:
ModelNB <- glm.nb(BUD ~ Treatment*YEAR, data=Data_Bud)
BUD = The number of floral buds on a twig
Treatment = 5 different fertiliser treatment
YEAR = 2 different years (2020 and 2021)

If computed the relative rejection frequency, how to measure if significantly different from significance levels? (Normality tests in R)

professionals and students,
I have significance levels 10%,5% & 1% and I have computed the relative rejection frequency thanks to an answer on my previous question.
replicate_sw10 = replicate(1000,shapiro.test(rnorm(10)))
table(replicate_sw10["p.value",]<0.10)/1000
> FALSE TRUE
> 0.909 0.091
But if I have done this for various sample sizes (T=10,30,50,100,500) and stored it manually via excel. Maybe there is an ever easier way to compute this in a function/list.
However how do I measure if it significantly different from significance levels?
(The hint is the following: the rejection of a test can be modelled as a Bernoulli random variable)
Best regards
So, the easiest way to do this is.. so if you perform 1000 test, you would expect approximately 0.1 of your test to have a pvalue < 0.1. It's like a bernoulli trial like you said, and you can use a binomial test to see the probability of something as extreme as your result:
set.seed(100)
replicate_sw10 = replicate(1000,shapiro.test(rnorm(10)))
obs_significant = sum(replicate_sw10["p.value",]<0.1)
binom.test(obs_significant,n=1000,p=0.1)
Exact binomial test
data: obs_significant and 1000
number of successes = 118, number of trials = 1000, p-value = 0.06479
alternative hypothesis: true probability of success is not equal to 0.1
95 percent confidence interval:
0.09865252 0.13962772
sample estimates:
probability of success
0.118

How to get tabulated interval of Wilcoxon-Mann-Whitney rank sum test

I was reading this topic on Rbloggers about the use of the Wilcoxon rank sum test: https://www.r-bloggers.com/wilcoxon-mann-whitney-rank-sum-test-or-test-u/
Especially this part, here I quote:
"We can finally compare the intervals tabulated on the tables of Wilcoxon for independent samples. The tabulated interval for two groups of 6 samples each is (26, 52)".
How can I get these "tabulated" values ?
I understand they used a table where the values are reported following the size of each samples, but I was wondering if there was a way to get them in R.
It is important because as I can understand the post, once you have a p-value > 0.05 and so cannot reject the null hypothesis H0, you can actually confirm H0 by comparing "computed" and "tabulated" intervals.
So what I would need is the tabulated intervals, using R.
tl;dr
You can get confidence intervals for a Mann-Whitney-Wilcoxon test by specifying conf.int=TRUE.
Don't believe everything you read on the internet ...
If by "confirm" you mean "make sure that the computation is true", you don't need to double-check by consulting the original tables; the p-value should be enough to decide whether you can reject H0 or not. You can trust R for standard, widely used statistical methods. (I also show below how to repeat the computation with a different implementation from the coin package, which is a nearly independent check.)
if by "confirm" you mean "accept the null hypothesis", please don't do this; this is a fundamental violation of frequentist statistical theory, which says that you can reject a null hypothesis, but that you can never accept the null. Wide confidence intervals and p-values greater than a given threshold are evidence that the conclusion is uncertain (we can't be sure whether the null or the alternative is true), not that the null is true. The concluding text of the blog post referred to ("we conclude by accepting the hypothesis H0 of equality of means") is statistically incorrect.
A better way to interpret the uncertainty is to look at the confidence intervals. You can compute these for the Wilcoxon test: from ?wilcox.test:
... (if argument ‘conf.int’ is true [and a two-sample test is being performed]), a nonparametric
confidence interval and an estimator for ... the difference of the location parameters
‘x-y’ is computed.
> a = c(6, 8, 2, 4, 4, 5)
> b = c(7, 10, 4, 3, 5, 6)
> wilcox.test(b,a, conf.int=TRUE, correct=FALSE)
data: b and a
W = 22, p-value = 0.5174
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-1.999975 4.000016
sample estimates:
difference in location
0.9999395
The high p-value (0.5174) says that we really can't tell whether the values in a or b have signicantly different ranks. The difference in location gives us the estimated difference between the median ranks, and the confidence interval gives the confidence interval on this difference. In this case, for a sample size of 12, the estimated difference in ranks is 1 (group b has slightly higher ranks than group a), and the confidence interval is (-2, 4) (the data are consistent with group b having slightly lower or much higher ranks than group a). It is admittedly rather difficult to interpret the substantive meaning of these values - that's one of the disadvantages of rank-based nonparametric tests ...
You can assume that the p-value computed by wilcox.test() is a reasonable summary of the evidence against the null hypothesis; there's no need to look up ranges in the tables. If you're worried about wilcox.test() in base R, you can try wilcox_test() from the coin package:
dd <- data.frame(f=rep(c("a","b"),each=6),x=c(a,b))
wilcox_test(x~f,data=dd,conf.int=TRUE) ## asymptotic test
which gives nearly identical results to wilcox.test(), and
wilcox_test(x~f,data=dd,conf.int=TRUE, distribution="exact")
which gives a slightly different p-value, but essentially the same confidence intervals.
of historical interest only
As for the tables: I found them on Google books, by doing a Google Scholar search with author:katti author:wilcox. There you can read the description of how they were computed; this wouldn't be impossible to replicate, but it seems unnecessary since p-values and confidence intervals are available via other methods. Digging through you find this:
The number 0.0206 in the red box indicates that the interval (26,52) corresponds to a one-tail p-value of 0.0206 (2-tailed = 0.0412); that's the closest you can get with a discrete range. The next closest range is given in the line below [(27,51), one-tailed p=0.0325, two-tailed=0.065]. In the 21st century you should never have to do this procedure.

How to calculate p-values for each feature in R using two sample t-test

I have two data frames cases and controls and I performed two sample t-test as shown below.But I am doing feature extraction from the feature set of (1299 features/columns) so I want to calculate p-values for each feature. Based on the p-value generated for each feature I want to reject or accept the null hypothesis.
Can anyone explain to me how the below output is interpreted and how to calculate the p-values for each feature?
t.test(New_data_zero,New_data_one)
Welch Two Sample t-test
data: New_data_zero_pca and New_data_one_pca
t = -29.086, df = 182840000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.02499162 -0.02183612
sample estimates:
mean of x mean of y
0.04553462 0.06894849
Look at ?t.test. x and y are supposed to be vectors not matrixes. So the function is automatically converting them to vectors. What you want to do, assuming that columns are features and the two matrixes have the same features, is:
pvals=vector()
for (i in seq(ncol(New_data_zero))){
pvals[i]=t.test(New_data_zero[,i], New_data_one[,i])$p.value
}
Then you can look at pvals (probably in log scale) and after multiple hypothesis testing correction (see ?p.adjust).
Let's also address the enormously bad idea of this approach to finding differences among your features. Even if all of the effects between these 1299 features are literally zero you will find *significant results in 0.05 of all possible 1299 2-way comparisons which makes this strategy effectively meaningless. I would strongly suggest taking a look at an introductory statistics text, especially the section on family-wise type I error rates before proceeding.

Interpreting var.test results in R

I am trying to learn F test and on performing the inbuilt var.test() in R, I obtained the following result var.test(gardenB,gardenC)
F test to compare two variances
data: gardenB and gardenC
F = 0.09375, num df = 9, denom df = 9, p-value = 0.001624
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.02328617 0.37743695
sample estimates:
ratio of variances
0.09375
I understand that based on the p-value, I should reject the Null hypothesis.
However, I am unable to understand the meaning conveyed by the 95 percent confidence interval?
I tried reading through the explanation provided for the queries:
https://stats.stackexchange.com/questions/31454/how-to-interpret-the-confidence-interval-of-a-variance-f-test-using-r
But am still able to understand the meaning conveyed in the confidence interval. Any help would be really appreciated?
Sorry, I know this is an old post but it showed up as the second result on google so I will try to answer the question still.
The confidence interval is for the RATIO of the two variances.
For example, if the variances are equal ie. var1 = var2, the ratio would be var1/var2 which is 1.
var.test() is usually used to test if the variances are equal. If 1 is not in the 95% confidence interval, it is safe to assume that the variances are not equal and thus, reject the hypothesis.

Resources