chi squared test R and excel - r

I'm working on independence testing for some stuff at work. I'm usually do this sort of thing in R, but my boss wanted me to do it in Excel for the graphs. My problem is that when I use R's chi squared test, it gives me a different result from the one Excel uses. I'm not sure if I'm setting things up wrong, or if there's a difference in methods used, but the results are pretty much polar opposites. Are the null hypotheses different in these two programs?
Here's what I've got:
Observed Values Expected Values
Total Errors Priority 1 + 2 Total Errors Priority 1 + 2
Non-V&T 342 188 530 Non-V&T 171.0759494 93.92405063
V&T 117 64 181 V&T 58.42405063 32.07594937
459 252 1422
Test value:
2.68619E-79
R:
tbl1 <- matrix(c(342,117,188,64),ncol=2)
chisq.test(tbl1)
Pearson's Chi-squared test with Yates' continuity correction
data: tbl1
X-squared = 1.6653e-30, df = 1, p-value = 1
chisq.test(tbl1)$expected
[,1] [,2]
[1,] 342.1519 187.8481
[2,] 116.8481 64.1519
ps. I can't seem to paste in what I had from Excel properly. The main point is that the p-value expected values are different from what R gives me.

I too am not sure how to paste from Excel at the moment, but I can provide you with the formulas I used in Excel via a screenshot. It produced a p-value of 0.9782, close to that given in R. Please see the following screenshot for the values:
In the above, I use the actual values as input into R. Cells A2:B3
I compute the marginal row and column sums
I compute the expected cell values by taking a product of the appropriate marginal row and column sum, and dividing by the overall sum. Cells A7:B8
I compute the p-value next using the actual and expected counts.
If you re-do the R procedure without the Yates correction, i.e chisq.test(tbl1, correct = F), you get a p-value of 0.9782, which corresponds to Excel's p-value.

Related

Simulation to find random sequences

With R I can try to find the probability that the Age vector below resulted from random sampling. I used the runs test (from randtests package) with resulted in p-value = 0.2892. Other colleagues used the rle functune (run length encoding in R) or others to simulate whether the probabilities of random allocation generating the observed sequences. Their result shows p < 0.00000001 that this sequence is the result of random sampling. I am trying to find the R code to replicate their findings. any help is highly appreciated on how to simulate to replicate their findings.
Update: I received advice from statistician that I can do this using non-parametric bootstrap. However, I still do not know how this can be done. I appreciate your help.
example:
Age <-c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73,69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73) ;
randtests::runs.test(Age);
X <- rle(Age);X$lengths
What was initially presented isn't the whole story. If one looks at the supplement where these numbers are from, the reported p-value is for comparing two vectors. OP only provides one, and hence the task is not well-defined.
The full assertion of the research article is that
group1 <- c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73)
group2 <- c(69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73)
being two independent random samples has a p-value < 0.00000001.
Even checking identity along position (10 entries in original) with permutations within a group, I'm seeing only 2 or 3 draws per million that have a similar number of identical values. I.e., something like:
set.seed(123)
mean(replicate(1e6, sum(sample(group1, length(group1)) == group2)) >= 10)
# 2e-06
Testing correlations and/or bootstrapping could easily be in the p-value range that is reported (nothing as extreme in 100 million simulations).

Fisher's Exact Test

In this post https://stats.stackexchange.com/questions/94909/course-of-action-for-2x2-tables-with-0s-in-cell-and-low-cell-counts, OP said that s/he got a p-value 0.5152 while conducted a Fisher's exact test for the following data:
Control Cases
A 8 0
B 14 0
But I am getting p-value=1 and odds ratio=0 for the data. My R codes are:
a <- matrix(c(8,14,0,0),2,2)
(res <- fisher.test(a))
Where am I doing mistake?
Good afternoon :)
https://en.wikipedia.org/wiki/Fisher%27s_exact_test
Haven't used these in a while, but I'm assuming its your column of two 0's:
p = choose(14, 14) * choose(8, 8)/ choose(22, 22)
which is 1.0. For odds ratio, read here: https://en.wikipedia.org/wiki/Odds_ratio
The 0's are either the numerators or the denominators. I think this makes sense, as a column of 0's effectively mean you have a group with no observations in.
You get the strange p-value=1 and OR=0 because one or more of your counts is 0. It should not be computed by the chi-square equation, which through multiplication yields chi-values of 0 for these respective cells:
Chi square equation, cell-by-cell.
Instead, you should use the Fisher's exact test ("fisher.test()") which to some extent can correct for the very low cell counts (normally you should use Fisher's for whenever you have at least 20% of cells with a count of <5). Source: https://www.ncbi.nlm.nih.gov/pubmed/23894860 Using the chi-square analysis will require you to correct using the Yates' correction, (e.g.: chisq.test(matrix, correct = T)).

ANOVA (AOV function) in R: Misleading p_value reported on equal values

I would greatly appreciate any guidance on the following: I am running ANOVA (aov) to retrieve p_value s for a number of subsets of a larger data set. So I kind of bumped into a subset where my numeric variables/values are equally 36. Because it is a part of a loop ANOVA is still executed along with reporting an seemingly infinitely small p_value 1.2855e-134--> Correct me if I am wrong but the smaller the p_value the higher the probability that the difference between the factors is significantly different?
For simplicity this is the subset:
sUBSET_FOR_ANOVA
Here is how I calculate ANOVA and retrieve p_value, where TEMP_DF2 is just the subset you see attached:
#
anova_sweep <- aov(TEMP_DF2$GOOD_PTS~TEMP_DF2$MACH,data = TEMP_DF2)
p_value <- summary(anova_sweep)[[1]][["Pr(>F)"]]
p_value <- p_value[1]
#
Many thanks for any guidance,
I can't replicate your findings. Let's produce an example dataset with all values being 36:
df <- data.frame(gr = rep(letters[1:2], 100),
y = 36)
summary(aov(y~gr, data = df))
Gives:
Df Sum Sq Mean Sq F value Pr(>F)
gr 1 1.260e-27 1.262e-27 1 0.319
Residuals 198 2.499e-25 1.262e-27
Basically, depending on the sample size, we obtain a p-value around 0.3 or so. The F statistic is (by definition) always 1, since the between and within group variances are equal.
Are there results misleading? To some extent, yes. The estimated SS and MS should be 0, aov calculates them as very very small. Some other statistical tests in R and in some packages check for zero variance and would produce an error, but aov apparently does not.
However, more importantly, I would say your data is violating the assumptions of the ANOVA and therefore any result cannot be trusted to base conclusion on. The expectation in R when it comes to statistical tests is usually that it is upon the user to employ the tests in the correct circumstances.

fisher's exact test (R) - simulated p-value does not vary

I have a problem using fisher’s exact test in R with a simulated p-value, but I don’t know if it’s a caused by “the technique” ( R ) or if it is (statistically) intended to work that way.
One of the datasets I want to work with:
matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,3,57,11,2,87,1,2,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704,40,759,404,151,1491,9,40,144),ncol=2,nrow=27)
The resulting p-value is always the same, no matter how often I repeat the test:
p = 1 / (B+1)
(B = number of replicates used in the Monte Carlo test)
When I shorten the matrix it works if the number of rows is lower than 19. Nevertheless it is not a matter of number of cells in the matrix. After transforming it into a matrix with 3 columns it still does not work, although it does when using the same numbers in just two columns.
Varying simulated p-values:
>a <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=2,nrow=18)
>b <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,19,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21,704),ncol=2,nrow=19)
>c <- matrix(c(103,0,2,1,0,0,1,0,3,0,0,3,0,0,0,0,0,0,869,4,2,8,1,4,3,18,16,5,60,60,42,1,1,1,1,21),ncol=3,nrow=12)
>fisher.test(a,simulate.p.value=TRUE)$p.value
Number of cells in a and b are the same, but the simulation only works with matrix a.
Does anyone know if it is a statistical issue or a R issue and, if so, how it could be solved?
Thanks for your suggestions
I think that you are just seeing a very significant result. The p-value is being computed as the number of simulated (and the original) matrices that are as extreme or more extreme than the original. If none of the randomly generated matrices are as or more extreme then the p-value will just be 1 (the original matrix is as extreme as itself) divided by the total number of matrices which is $B+1$ (the B simulated and the 1 original matrix). If you run the function with enough samples (high enough B) then you will start to see some of the random matrices as or more extreme and therefor varying p-values, but the time to do so is probably not reasonable.

Significance testing in R, determining if the proportion in one column is significantly different from the other column within the single variable

I'm sure this is an easy command in R, but for some reason, I'm having trouble finding a solution.
I'm trying to run a bunch of crosstabs (using the table() command) in R, and each tab has two columns (treatment and no treatment). I would like to know if the difference between the columns are significantly different for each other for all rows (the rows are a handful of answer choices from a survey). I'm not interested in overall significance, only within the crosstab comparing treatment vs. no treatment.
This type of analysis is very easy in SPSS (link below to illustrate what I'm talking about), but I can't seem to get it working in R. Do you know I can do this?
http://help.vovici.net/robohelp/robohelp/server/general/projects_fhpro/survey_workbench_MX/Significance_testing.htm
EDITED:
Here is an example of in R about what I mean:
treatmentVar <-c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1) # treatment is 1 or 0
question1 <-c(1,2,2,3,1,1,2,2,3,1,1,2,2,3,1,3) #choices available are 1, 2, or 3
Questiontab <- table(question1, treatmentVar)
Questiontab
I have tables like this ^ (percentaged by column on the treatmentVar), and I would like to see if there is a significant difference between each question choice (rows) going from treatment 0 to treatment 1. So in the example above, I would want to know if there is a significant difference between 4 and 2 (row 1), 3 and 3 (row 2), and 1 and 3 (row 3). So in this example, the choices for question1 might be significantly difference for choices 1 and 3 (because the difference is 2) but the difference for choice 2 isn't because the difference is zero. Ultimately, I'm trying to determine this type of significance. I hope that helps.
Thanks!
I think the function you're looking for is pairwise.prop.test(). See ?pairwise.prop.test for an example.
Using your example, either the chisq.test or prop.test (equivalent in this case):
> chisq.test(Questiontab)
Pearson's Chi-squared test
data: Questiontab
X-squared = 1.6667, df = 2, p-value = 0.4346
Warning message:
In chisq.test(Questiontab) : Chi-squared approximation may be incorrect
> prop.test(Questiontab)
3-sample test for equality of proportions without continuity
correction
data: Questiontab
X-squared = 1.6667, df = 2, p-value = 0.4346
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3
0.6666667 0.5000000 0.2500000
Warning message:
In prop.test(Questiontab) : Chi-squared approximation may be incorrect
Note the warning; these tests aren't necessarily appropriate for such small numbers.

Resources