This is possibly a stupid question, but I was told to do a Redundancy Analysis in R (using the package Vegan) to test the differences between different groups in my data. However I only have one dataset (roughly comparable to the Iris dataset (https://en.wikipedia.org/wiki/Iris_flower_data_set)), and everything I have found on RDA seems to need two matching sets. Did I mishear or misunderstand, or is there something else going on here?
As far as the underlying statistics are concerned, you have two data matrices;
the four morphological variables in the iris data set
a single categorical predictor variable or constraint
In vegan using rda() for this and the iris example data you'd do:
library("vegan")
iris.d <- iris[, 1:4]
ord <- rda(iris.d ~ Species, data = iris)
ord
set.seed(1)
anova(ord)
The permutation test, tests for differences between species.
> anova(ord)
Permutation test for rda under reduced model
Permutation: free
Number of permutations: 999
Model: rda(formula = iris.d ~ Species, data = iris)
Df Variance F Pr(>F)
Model 2 3.9736 487.33 0.001 ***
Residual 147 0.5993
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
You might also look at adonis(), which should do the same thing here as RDA but from a different view point:
> adonis(iris.d ~ Species, data = iris)
Call:
adonis(formula = iris.d ~ Species, data = iris)
Permutation: free
Number of permutations: 999
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
Species 2 2.31730 1.15865 532.74 0.87876 0.001 ***
Residuals 147 0.31971 0.00217 0.12124
Total 149 2.63701 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(For some reason that is a lot slower...)
Also see betadisper() as you might detect a difference in means (centroids) using these methods where that may be due at least in part to differences in variance (dispersion).
Related
I have a question regarding plotting and interpreting the relationship between continuous environmental variables to an NMDS ordination of species abundances using R.
This is a programming follow-up question to a question on Cross-Validated (stats.stackexchange).
In R's vegan package, function envfit() conveniently provides estimated coefficients for drafting a vector in 2-dimensional ordination space. This is convenient because we get a quantitative way to compare a given environmental variable to both axes of the ordination separately.
However, this really only works if the environmental variables have a linear relationship with the axes -- something again discussed in this CV post. When the variables have a non-linear relationship to each ordination axis, another vegan function, ordisurf() can be used.
But instead of providing a direct quantitative relationship of the variable with each axis, ordisurf() instead simply provides summary output for a gam model fit overall.
Question: Is there a way to extract and/or translate the summary.ordisurf output into the context of individual estimated coefficients for each ordination axis separately?
Vegan coauthor, Gavin Simpson, encouraged me to ask here on SO for an answer.
Below is example code (mostly borrowed from Gavin). ordisurf code:
require("vegan")
data(dune)
data(dune.env)
## fit NMDS using Bray-Curtis dissimilarity (default)
set.seed(12)
sol <- metaMDS(dune)
## NMDS plot
plot(sol)
## Fit and add the 2d surface
sol.s <- ordisurf(sol ~ A1, data = dune.env, method = "REML",
select = TRUE)
## look at the fitted model
summary(sol.s)
This produces:
> summary(sol.s)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1, x2, k = knots[1], bs = bs[1])
<environment: 0x2fb78a0>
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8500 0.4105 11.81 9.65e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1,x2) 1.591 9 0.863 0.0203 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.29 Deviance explained = 35%
REML score = 41.587 Scale est. = 3.3706 n = 20
whereas envfit() would produce something separating the relationships with both axes (i.e., "NMDS1" and "NMDS2"):
envfit(sol ~ A1, data = dune.env)
***VECTORS
NMDS1 NMDS2 r2 Pr(>r)
A1 0.96473 0.26323 0.3649 0.02 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Permutation: free
Number of permutations: 999
I am a complete beginner in R/R Studio, coding and statistics in general.
In R, I am running a GLM where my Y variable is a no/yes (0/1) category and my X variable is a Sex category (female/male).
So I have run the following script:
hello <- read.csv(file.choose())
hello$sexbin <- ifelse(hello$Sex == 'm',0,ifelse(hello$Sex == 'f',1,NA))
modifhello <- subset(hello,hello$Combi_cag_long>=36)
model1 <- glm(modifhello$VAB~modifhello$Sex, family=binomial(link=logit),
na.action=na.exclude, data=modifhello)
summary.lm(model1)
However, in my output, R seems to have split male/female as two separate variables, suggesting that it is not treating them as proper binary variables:
Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.689 1.009 -3.656 0.000258 ***
modifhello$Sexf 2.506 1.010 2.482 0.013084 *
modifhello$Sexm 2.922 1.010 2.894 0.003820 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
What do I need to add to my script to correct this?
FOUND THE SOLUTION
Need to simply put modifhello$VAB~modifhello$sexbin not modifhello$VAB~modifhello$sex (as this is the old column).
I have seven groups that I want to run ANOVA test on to see if there is a significant difference among each other based on a trait. And I have about 600 traits.
I already calculated per group and per trait their mean, standard deviation, and variance. the seven groups have different sample sizes. How can I arrange my data so that I will be able to run them all in R?
set.seed(2)
sampledata <- expand.grid(group = paste0("group", 1:7), trait = paste0("trait", 1:600), value = 1:5)
sampledata$value <- rnorm(nrow(sampledata))
sampledata.aov <- aov(value ~ group * trait, data = sampledata)
anova(sampledata.aov)
Analysis of Variance Table
Response: value
Df Sum Sq Mean Sq F value Pr(>F)
group 6 7.1 1.1784 1.1670 0.32072
trait 599 658.0 1.0985 1.0878 0.07096 .
group:trait 3594 3613.0 1.0053 0.9955 0.56604
Residuals 16800 16964.3 1.0098
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A warning though, even with random numbers, you're more likely than not to have a significant difference when you have this many traits at once.
I have a large multivariate abundance data and I am interested in comparing multiple models that fit different combinations of three categorical predictor variables to my species matrix response variable. I have been using anova() to compare my different models, but I am having difficulty interpreting the output. Below, I have given my code as well as the corresponding R output.
invert.mvabund <- mvabund(mva.dat)
null<-manyglm(mva.dat~1, family='negative.binomial')
m1 <- manyglm(mva.dat~Habitat+Detritus, family='negative.binomial')
m2 <- manyglm(mva.dat~Habitat*Detritus, family='negative.binomial')
m3 <- manyglm(mva.dat~Habitat*Detritus+Block, family='negative.binomial')
anova(null,m1,m2,m3)
Analysis of Deviance Table
null: mva.dat ~ 1
m1: mva.dat ~ Habitat + Detritus
m2: mva.dat ~ Habitat * Detritus
m3: mva.dat ~ Habitat * Detritus + Block
Multivariate test:
Res.Df Df.diff Dev Pr(>Dev)
null 99
m1 94 5 257.2 0.001 ***
m2 90 4 87.7 0.003 **
m3 81 9 173.5 0.003 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
How do I interpret these results? Is m2 the best-fitting model because it has the lowest deviance, even though it has a higher p-value than m1? Is this because the p-value is suggesting that there is a significant level of deviance, so the optimal model will have a higher p-value? Any suggestions on how to interpret these results would be much appreciated- I haven't been able to find a clear answer in my Google searches. Thanks!
I have just discovered reshaping in R and am unsure of how to proceed with an ANOVA once the data is reshaped. I found this site which has the data organized in a way very similar to my own data. If I were using this hypothetical data, how would I conduct a 3-way ANOVA say between race, program and subject? Now that the subjects have been reshaped into a single column I'm having trouble seeing how to include this variable using the typical ANOVA code. Any help would be much appreciated!
Assuming the data are in 'long format' and 'score' is your dependent variable you could do something like:
mymodel = aov(score ~ prog + race + subj, data=l)
summary(my model)
Which in this case yields:
Df Sum Sq Mean Sq F value Pr(>F)
prog 1 2864 2864 31.32 2.82e-08 ***
race 1 5064 5064 55.39 2.14e-13 ***
subj 4 106 27 0.29 0.885
Residuals 993 90780 91
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
n.b. this model contains only the main effects