Pairwise comparisons with cell values less than 1 - r

This is an example modified from https://rcompanion.org/rcompanion/b_05.html. I have a 4 by 2 matrix, with cell values less than 1. I would like to compare pairewise differences between supplement. I would like to show that Selenium is different when compared with Vitamin E and Selenium+E as shown. The current code shows all 1's for the p-values.
I would like to modify the current code to perform a test that accounts for the differences when the percentages are less than 1.
Input =("
Supplement No.cancer Cancer
'Selenium' 0.894 0.776
'Vitamin E' 0.973 0.916
'Selenium+E' 0.988 0.936
'Placebo' 0.967 0.847
")
Matriz = as.matrix(read.table(textConnection(Input),
header=TRUE,
row.names=1))
Matriz
pairwiseNominalIndependence(Matriz,
fisher = FALSE,
gtest = FALSE,
chisq = TRUE,
method = "fdr")

You can't use a chi-square test of association unless you have a table of (whole) counts. But some effect size statistics that are often applied to tables of counts might be applicable. For example, you could look at the odds ratio across pairwise rows.

Related

Am I following the correct procedures with the dunn.test function?

I tested differences among sampling sites in terms of abundance values using kruskal.test. However, I want to determine the multiple differences between sites.
The dunn.test function has the option to use a vector data with a categorical vector or use the formula expression as lm.
I write the function in the way to use in a data frame with many columns, but I have not found an example that confirms my procedures.
library(dunn.test)
df<-data.frame(a=runif(5,1,20),b=runif(5,1,20), c=runif(5,1,20))
kruskal.test(df)
dunn.test(df)
My results were:
Kruskal-Wallis chi-squared = 6.02, df = 2, p-value = 0.04929
Kruskal-Wallis chi-squared = 6.02, df = 2, p-value = 0.05
Comparison of df by group
Between 1 and 2 2.050609, 0.0202
Between 1 and 3 -0.141421, 0.4438
Between 2 and 3 -2.192031, 0.0142
I took a look at your code, and you are close. One issue is that you should be specifying a method to correct for multiple comparisons, using the method argument.
Correcting for Multiple Comparisons
For your example data, I'll use the Benjamini-Yekutieli variant of the False Discovery Rate (FDR). The reasons why I think this is a good performer for your data are beyond the scope of StackOverflow, but you can read more about it and other correction methods here. I also suggest you read the associated papers; most of them are open-access.
library(dunn.test)
set.seed(711) # set pseudorandom seed
df <- data.frame(a = runif(5,1,20),
b = runif(5,1,20),
c = runif(5,1,20))
dunn.test(df, method = "by") # correct for multiple comparisons using "B-Y" procedure
# Output
data: df and group
Kruskal-Wallis chi-squared = 3.62, df = 2, p-value = 0.16
Comparison of df by group
(Benjamini-Yekuteili)
Col Mean-|
Row Mean | 1 2
---------+----------------------
2 | 0.494974
| 0.5689
|
3 | -1.343502 -1.838477
| 0.2463 0.1815
alpha = 0.05
Reject Ho if p <= alpha/2
Interpreting the Results
The first row in each cell provides the Dunn's pairwise z test statistic for each comparison, and the second row provides your corrected p-values.
Notice that, once corrected for multiple comparisons, none of your pairwise tests are significant at an alpha of 0.05, which is not surprising given that each of your example "sites" was generated by exactly the same distribution. I hope this has been helpful. Happy analyzing!
P.S. In the future, you should use set.seed() if you're going to construct example dataframes using runif (or any other kind of pseudorandom number generation). Also, if you have other questions about statistical analysis, it's better to ask at: https://stats.stackexchange.com/

Letters group Games-Howell post hoc in R

I use the sweetpotato database included in library agricolae of R:
data(sweetpotato)
This dataset contains two variables: yield(continous variable) and virus(factor variable).
Due to Levene test is significant I cannot assume homogeneity of variances and I apply Welch test in R instead of one-way ANOVA followed by Tukey posthoc.
Nevertheless, the problems come from when I apply posthoc test. In Tukey posthoc test I use library(agricolae) and displays me the superscript letters between virus groups. Therefore there are no problems.
Nevertheless, to perform Games-Howell posthoc, I use library(userfriendlyscience) and I obtain Games-Howell output but it's impossible for me to obtain a letter superscript comparison between virus groups as it is obtained through library(agricolae).
The code used it was the following:
library(userfriendlyscience)
data(sweetpotato)
oneway<-oneway(sweetpotato$virus, y=sweetpotato$yield, posthoc =
'games-howell')
oneway
I try with cld() importing previously library(multcompView) but doesn't work.
Can somebody could helps me?
Thanks in advance.
This functionality does not exist in userfriendlyscience at the moment. You can see which means differ, and with which p-values, by looking at the row names of the dataframe with the post-hoc test results. I'm not sure which package contains the sweetpotato dataset, but using the ChickWeight dataset that comes with R (and is used on the oneway manual page):
oneway(y=ChickWeight$weight, x=ChickWeight$Diet, posthoc='games-howell');
Yields:
### (First bit removed as it's not relevant.)
### Post hoc test: games-howell
diff ci.lo ci.hi t df p
2-1 19.97 0.36 39.58 2.64 201.38 .044
3-1 40.30 17.54 63.07 4.59 175.92 <.001
4-1 32.62 13.45 51.78 4.41 203.16 <.001
3-2 20.33 -6.20 46.87 1.98 229.94 .197
4-2 12.65 -10.91 36.20 1.39 235.88 .507
4-3 -7.69 -33.90 18.52 0.76 226.16 .873
The first three rows compare groups 2, 3 and 4 to 1: using alpha = .05, 1 and 2 have the same means, but 3 and 4 are higher. This allows you to compute the logical vector you need for multCompLetters in multcompView. Based on the example from the manual page at ?multcompView:
### Run oneway anova and store result in object 'res'
res <- oneway(y=ChickWeight$weight, x=ChickWeight$Diet, posthoc='games-howell');
### Extract dataframe with post hoc test results,
### and overwrite object 'res'
res <- res$intermediate$posthoc;
### Extract p-values and comparison 'names'
pValues <- res$p;
### Create logical vector, assuming alpha of .05
dif3 <- pValues > .05;
### Assign names (row names of post hoc test dataframe)
names(dif3) <- row.names(res);
### convert this vector to the letters to compare
### the group means (see `?multcompView` for the
### references for the algorithm):
multcompLetters(dif3);
This yields as final result:
2 3 4 1
"a" "b" "c" "abc"
This is what you need, right?
I added this functionality to userfriendlyscience, but it will be a while before this new version will be on CRAN. In the meantime, you can get the source code for this update at https://github.com/Matherion/userfriendlyscience/blob/master/R/oneway.R if you want (press the 'raw' button to get an easy-to-download version of the source code).
Note that if you need this updated version, you need to set parameter posthocLetters to TRUE, because it's FALSE by default. For example:
oneway(y=ChickWeight$weight,
x=ChickWeight$Diet,
posthoc='games-howell',
posthocLetters=TRUE);
shouldn't it be
dif3 <- pValues < .05, instead of dif3 <- pValues > .05 ?
This way the letters are the same if the distributions are 'the same' (this is, no evidence that they are different).
Please correct me if I'm interpreting this wrong.

Confusing p values with ANOVA on a big dataframe

I am trying to analyse the significant differences between different car company performance values across different countries. I am using ANOVA to do this.
Running ANOVA on my real dataset (30 countries, 1000 car companies and 90000 measurement scores) gave every car a zero p-value.
Confused by this, I created a reproducible example (below) with 30 groups, 3 car companies, 90000 random scores. Purposely, I kept a score of 1 for the Benz company where you shouldn't see any difference between countries. After running anova, I see a pvalue of 0.46 instead of 1.
Does any one know why is this ?
Reproducible example
set.seed(100000)
qqq <- 90000
df = data.frame(id = c(1:90000), country = c(rep("usa",3000), rep("usb",3000), rep("usc",3000), rep("usd",3000), rep("use",3000), rep("usf",3000), rep("usg",3000), rep("ush",3000), rep("usi",3000), rep("usj",3000), rep("usk",3000), rep("usl",3000), rep("usm",3000), rep("usn",3000), rep("uso",3000), rep("usp",3000), rep("usq",3000), rep("usr",3000), rep("uss",3000), rep("ust",3000), rep("usu",3000), rep("usv",3000), rep("usw",3000), rep("usx",3000), rep("usy",3000), rep("usz",3000), rep("usaa",3000), rep("usab",3000), rep("usac",3000), rep("usad",3000)), tesla=runif(90000), bmw=runif(90000), benz=rep(1, each=qqq))
str(df)
out<-data.frame()
for(j in 3:ncol(df)){
amod2 <- aov(df[,j]~df$country)
out[(j-2),1]<-colnames(df)[j]
out[(j-2),2]<-summary(amod2, test = adjusted("bonferroni"))[[1]][[1,"Pr(>F)"]]
}
colnames(out)<-c("cars","pvalue")
write.table(out,"df.output")
df.output
"cars" "pvalue"
"1" "tesla" 0.245931589754359
"2" "bmw" 0.382730335188437
"3" "benz" 0.465083026215268
With respect to the "benz" p-value in your reproducible example: an ANOVA analysis requires positive variance (i.e., non-constant data). If you violate this assumption, the model is degenerate. Technically, the p-value is based on an F-statistic whose value is a normalized ratio of the variance attributable to the "country" effect (for "benz" in your example, zero) divided by the total variance (for "benz" in your example, zero), so your F-statistic has "value" 0/0 or NaN.
Because of the approach R takes to calculating the F-statistic (using a QR matrix decomposition to improve numerical stability in "nearly" degenerate cases), it calculates an F-statistic equal to 1 (w/ 29 and 89970 degrees of freedom). This gives a p-value of:
> pf(1, 29, 89970, lower=FALSE)
[1] 0.465083
>
but it is, of course, largely meaningless.
With respect to your original problem, with large datasets relatively small effects will yield very small p-values. For example, if you add the following after your df definition above to introduce a difference in country usa:
df = within(df, {
o = country=="usa"
tesla[o] = tesla[o] + .1
bmw[o] = bmw[o] + .1
benz[o] = benz[o] + .1
rm(o)
})
you will find that out looks like this:
> out
cars pvalue
1 tesla 9.922166e-74
2 bmw 5.143542e-74
3 benz 0.000000e+00
>
Is this what you're seeing, or are you seeing all of them exactly zero?

How to exclude unwanted comparisons in two-way ANOVA in R

I have asked about this already on stats.exchange (original question), now I re-posted the same content here - hoping to get help from a wider population.
I would like to know the way to exclude all the unwanted pairs from the output generated from two-way ANOVA, so when there shows a significant result from summary(aov()), the post-hoc test won't give me any comparisons I don't want. Details as follows:
I have datTable contain proportion data under two factor site (four levels: A, B, C, D) and treatment(two levels: control and treated). Specifically, I want to do a pair-wise test among all the site under each same treatment (e.g. control-A VS. control-B, control-A VS.control-C, treated-A VS.treated-C, etc.), while excludes comparisons between different sites and different treatments(e.g., pairs such as control-A VS. treated-B, control-B VS. treated-C).
The data looks like this:
> datTable
site treatment proportion
A control 0.5000000
A control 0.4444444
A treated 0.1000000
A treated 0.4000000
B control 0.4444444
B control 0.4782609
B treated 0.0500000
B treated 0.3000000
C control 0.3214286
C control 0.4705882
C treated 0.1200000
C treated 0.4000000
D control 0.3928571
D control 0.4782609
D treated 0.4000000
D treated 0.4100000
I did a two-way ANOVA (also not sure whether to use within subject site/treatment or between subject site*treatment...), and summarised the results.
m1 <- aov(proportion~site*treatment,data=datTable) # Or should I use 'site/treatment'?
Then my summary(m1) gave me the following:
> summary(m1)
Df Sum Sq Mean Sq F value Pr(>F)
site 3 0.02548 0.00849 0.513 0.6845
treatment 1 0.11395 0.11395 6.886 0.0305 *
site:treatment 3 0.03686 0.01229 0.742 0.5561
Residuals 8 0.13239 0.01655
Next step is to use TukeyHSD post-hoc test to see actually which pair caused the * significance in site factor.
> TukeyHSD(m1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = proportion ~ site * treatment, data = datTable)
$site
diff lwr upr p adj
B-A -0.042934783 -0.3342280 0.2483585 0.9631797
C-A -0.033106909 -0.3244002 0.2581863 0.9823452
D-A 0.059168392 -0.2321249 0.3504616 0.9124774
C-B 0.009827873 -0.2814654 0.3011211 0.9995090
D-B 0.102103175 -0.1891901 0.3933964 0.6869754
D-C 0.092275301 -0.1990179 0.3835685 0.7461309
$treatment
diff lwr upr p adj
treated-control -0.1687856 -0.3171079 -0.02046328 0.0304535
$`site:treatment`
diff lwr upr p adj
B:control-A:control -0.010869565 -0.5199109 0.4981718 1.0000000
C:control-A:control -0.076213819 -0.5852551 0.4328275 0.9979611
D:control-A:control -0.036663216 -0.5457045 0.4723781 0.9999828
A:treated-A:control -0.222222222 -0.7312635 0.2868191 0.6749021
B:treated-A:control -0.297222222 -0.8062635 0.2118191 0.3863364 # Not wanted
C:treated-A:control -0.212222222 -0.7212635 0.2968191 0.7154690 # Not wanted
D:treated-A:control -0.067222222 -0.5762635 0.4418191 0.9990671 # Not wanted
C:control-B:control -0.065344254 -0.5743856 0.4436971 0.9992203
D:control-B:control -0.025793651 -0.5348350 0.4832477 0.9999985
A:treated-B:control -0.211352657 -0.7203940 0.2976887 0.7189552 # Not wanted
B:treated-B:control -0.286352657 -0.7953940 0.2226887 0.4233804 # Not wanted
C:treated-B:control -0.201352657 -0.7103940 0.3076887 0.7583437 # Not wanted
D:treated-B:control -0.056352657 -0.5653940 0.4526887 0.9996991
D:control-C:control 0.039550603 -0.4694907 0.5485919 0.9999713
A:treated-C:control -0.146008403 -0.6550497 0.3630329 0.9304819 # Not wanted
B:treated-C:control -0.221008403 -0.7300497 0.2880329 0.6798628 # Not wanted
C:treated-C:control -0.136008403 -0.6450497 0.3730329 0.9499131
D:treated-C:control 0.008991597 -0.5000497 0.5180329 1.0000000 # Not wanted
A:treated-D:control -0.185559006 -0.6946003 0.3234823 0.8168230 # Not wanted
B:treated-D:control -0.260559006 -0.7696003 0.2484823 0.5194129 # Not wanted
C:treated-D:control -0.175559006 -0.6846003 0.3334823 0.8505865 # Not wanted
D:treated-D:control -0.030559006 -0.5396003 0.4784823 0.9999950
B:treated-A:treated -0.075000000 -0.5840413 0.4340413 0.9981528
C:treated-A:treated 0.010000000 -0.4990413 0.5190413 1.0000000
D:treated-A:treated 0.155000000 -0.3540413 0.6640413 0.9096378
C:treated-B:treated 0.085000000 -0.4240413 0.5940413 0.9960560
D:treated-B:treated 0.230000000 -0.2790413 0.7390413 0.6429921
D:treated-C:treated 0.145000000 -0.3640413 0.6540413 0.9326207
However, there are some pairs I don't want to be included in the two-way ANOVA which I preformed, specified as # not wanted.
Is there any way that I can tweak the aov or TukeyHSD function to exclude those possibilities ('not wanted' ones) I listed above? I could easily select the significant entires that I am interested (with *) from the long list produced from TukeyHSD. But I don't want my result from anova to be biased by those! (It happens in the real data that the significance actually caused by those unwanted pairs!)
NB: You might have noticed that the site:treatment post-hoc tests doesn't show any significance, this is because I only selected a small sample from the original data.
If you mean to exclude those comparisons completely from the calculations, Tukey's test works by doing pairwise comparisons for all combinations of conditions. It doesn't make sense to "exclude" any pairs.
If you mean you want to exclude the unwanted comparisons from showing in your final results then yes, it is possible. The result of TukeyHSD is simply a list and site:treatment is simply a matrix which you can manipulate as you like.
lst <- TukeyHSD(m1)
lst[['site:treatment']] <- lst[['site:treatment']][-c(5,6,7,10,11,12,15,16,18,19,20,21),]

Multinomial logit models and nested logit models

I am using the mlogit package in program R. I have converted my data from its original wide format to long format. Here is a sample of the converted data.frame which I refer to as 'long_perp'. All of the independent variables are individual specific. I have 4258 unique observations in the data-set.
date_id act2 grp.bin pdist ship sea avgknots shore day location chid alt
4.dive 40707_004 TRUE 2 2.250 second light 14.06809 2.30805 12 Lower 4 dive
4.fly 40707_004 FALSE 2 2.250 second light 14.06809 2.30805 12 Lower 4 fly
4.none 40707_004 FALSE 2 2.250 second light 14.06809 2.30805 12 Lower 4 none
5.dive 40707_006 FALSE 2 0.000 second light 15.12650 2.53312 12 Lower 5 dive
5.fly 40707_006 TRUE 2 0.000 second light 15.12650 2.53312 12 Lower 5 fly
5.none 40707_006 FALSE 2 0.000 second light 15.12650 2.53312 12 Lower 5 none
6.dive 40707_007 FALSE 1 1.995 second light 14.02101 2.01680 12 Lower 6 dive
6.fly 40707_007 TRUE 1 1.995 second light 14.02101 2.01680 12 Lower 6 fly
6.none 40707_007 FALSE 1 1.995 second light 14.02101 2.01680 12 Lower 6 none
'act2' is the dependent variable and consists of choices a bird floating on the water could make when approached by a ship; fly, dive, or none. I am interested in how these probabilities relate to the remaining independent variables in the data.frame, i.e. perpendicular distance to the ship path (pdist) sea conditions (sea), speed (avgknots), distance to shore (shore) etc. The independent variables are made of dichotomous, factor and continuous variables.
I ran two multinomial logit models, one including all the choice options and another including only a subset. I then compared these models with the hmftest() function to test for the IIA assumption. The results were confusing the say the least. I will include the codes for the two models and the test output (in case I am miss-specifying the models in the code).
# model including all choice options (fly, dive, none)
mod.1 <- mlogit(act2 ~ 1 | pdist + as.factor(grp.bin) +
as.factor(sea) + avgknots + shore + as.factor(location),long_perp ,
reflevel = 'none')
# model including only a subset of choice options (fly, dive)
mod.alt <- mlogit(act2 ~ 1 | pdist + as.factor(grp.bin) +
as.factor(sea) + avgknots + shore + as.factor(location),long_perp ,
reflevel = 'none', alt.subset = c("fly","dive"))
# IIA test
hmftest(mod.1, mod.alt)
# output
Hausman-McFadden test
data: long_perp
chisq = -968.7303, df = 7, p-value = 1
alternative hypothesis: IIA is rejected
As you can see the chisquare statistic is negative! I assume I am either 1. doing something wrong, or 2. IIA is violated. This result holds true for choice subset (fly, dive), but the IIA assumption is upheld with choice subset (none, dive)? This confuses me.
Next I tried to formulate a nested model as a way to relax the IIA assumption. I nested the choices as nest1 = none, nest2 = fly, dive. This makes sense to me as this seems like a logical break, the bird decides to react or not then decides which reaction to make.
I am unclear on how to run the nested logit models (even after reading the two vignettes for mlogit, Croissant vignette and Train vignette).
When I run my analysis following the example in the Croissant vignette I get the following error.
nested.1 <- mlogit(act2 ~ 0 | pdist + as.factor(grp.bin) + as.factor(ship) +
as.factor(sea) + avgknots + shore + as.factor(location),
long_perp , reflevel="none",nests = list(noact = "none",
react = c("dive","fly")), unscaled = TRUE)
# Error in solve.default(crossprod(attr(x, "gradi")[, !fixed])) :
Lapack routine dgesv: system is exactly singular: U[19,19] = 0
I have read a bit about this error message and it may occur because of complete separation. I have looked at some tables of the data and do not believe this is happening as I have 4,000+ observations and only one factor variable with more than 2 levels (it has 3).
Help on these specific problems is greatly appreciated but I am also open to alternate analyses that I can use to answer my question. I am mainly interested in the probability of flying as a function of perpendicular distance to the ships path.
Thanks, Tim
To get a positive chi-sq, change the code as follows:
alt.subset = c("none", "fly")
that is, the ref level will be in the subset too. It may help, though the P-value may not change much.

Resources