how to interpret cca vegan output - r

I have performed a canonical correspondece analysis in R using the vegan package but i find the output very difficult to understand. The triplot is understandable, but all the numbers I get from the summary(cca) are confusing to me (as i've just started to learn about ordination techniques)
I would like to know how much of the variance in Y that is explained by X (in this case, the environmental variables) and which of the independent variables that are important in this model?
my output looks like this:
Partitioning of mean squared contingency coefficient:
Inertia Proportion
Total 4.151 1.0000
Constrained 1.705 0.4109
Unconstrained 2.445 0.5891
Eigenvalues, and their contribution to the mean squared contingency coefficient
Importance of components:
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7
Eigenvalue 0.6587 0.4680 0.34881 0.17690 0.03021 0.02257 0.0002014
Proportion Explained 0.1587 0.1127 0.08404 0.04262 0.00728 0.00544 0.0000500
Cumulative Proportion 0.1587 0.2714 0.35548 0.39810 0.40538 0.41081 0.4108600
CA1 CA2 CA3 CA4 CA5 CA6 CA7
Eigenvalue 0.7434 0.6008 0.36668 0.33403 0.28447 0.09554 0.02041
Proportion Explained 0.1791 0.1447 0.08834 0.08047 0.06853 0.02302 0.00492
Cumulative Proportion 0.5900 0.7347 0.82306 0.90353 0.97206 0.99508 1.00000
Accumulated constrained eigenvalues
Importance of components:
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7
Eigenvalue 0.6587 0.4680 0.3488 0.1769 0.03021 0.02257 0.0002014
Proportion Explained 0.3863 0.2744 0.2045 0.1037 0.01772 0.01323 0.0001200
Cumulative Proportion 0.3863 0.6607 0.8652 0.9689 0.98665 0.99988 1.0000000
Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions
Species scores
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6
S.marinoi -0.3890 0.39759 0.1080 -0.005704 -0.005372 -0.0002441
C.tripos 1.8428 0.23999 -0.1661 -1.337082 0.636225 -0.5204045
P.alata 1.6892 0.17910 -0.3119 0.997590 0.142028 0.0601177
P.seriata 1.4365 -0.15112 -0.8646 0.915351 -1.455675 -1.4054078
D.confervacea 0.2098 -1.23522 0.5317 -0.089496 -0.034250 0.0278820
C.decipiens 2.2896 0.65801 -1.0315 -1.246933 -0.428691 0.3649382
P.farcimen -1.2897 -1.19148 -2.3562 0.032558 0.104148 -0.0068910
C.furca 1.4439 -0.02836 -0.9459 0.301348 -0.975261 0.4861669
Biplot scores for constraining variables
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6
Temperature 0.88651 0.1043 -0.07283 -0.30912 -0.22541 0.24771
Salinity 0.32228 -0.3490 0.30471 0.05140 -0.32600 0.44408
O2 -0.81650 0.4665 -0.07151 0.03457 0.20399 -0.20298
Phosphate 0.22667 -0.8415 0.41741 -0.17725 -0.06941 -0.06605
TotP -0.33506 -0.6371 0.38858 -0.05094 -0.24700 -0.25107
Nitrate 0.15520 -0.3674 0.38238 -0.07154 -0.41349 -0.56582
TotN -0.23253 -0.3958 0.16550 -0.25979 -0.39029 -0.68259
Silica 0.04449 -0.8382 0.15934 -0.22951 -0.35540 -0.25650
Which of all these numbers are important to my analysis?
/anna

How much variation is explained by X?
In a CCA, variance isn't variance in the normal sense. We express it as the "mean squared contingency coefficient", or "inertia". All the info you need to ascertain how much "variation" in Y is explained by X is contained in the section of the output that I reproduce below:
Partitioning of mean squared contingency coefficient:
Inertia Proportion
Total 4.151 1.0000
Constrained 1.705 0.4109
Unconstrained 2.445 0.5891
In this example there is total inertia 4.151 and your X variables (these are "Constraints") explain a total of 1.705 bits of inertia, which is about 41%, leaving about 59% unexplained.
The next section referring to eigenvalues allows you to see both in terms of inertia explained and proportion explained which axes contribute significantly to the explanatory "power" of the CCA (the Constrained part of the table above) and the unexplained "variance" (the Unconstrained part of the table above.
The next section contains the ordination scores. Think of these as the coordinates of the points in the triplot. For some reason you show the site scores in the output above, but they would normally be there. Note that these have been scaled - by default this is using scaling = 2 - so site points are at their weighted average of the species scores IIRC etc.
The "Biplot" scores are the locations of the arrow heads or the labels on the arrows - I forget exactly how the plot is drawn now.
Which of all these numbers are important to my analysis?
All of them are important - if you think the triplot is important an interpretable, it is based entirely on the information reported by summary(). If you have specific questions to ask of the data, then perhaps only certain sections will be of paramount importance to you.
However, StackOverflow is not the place to ask such questions of a statistical nature.

I don't have the ability to comment. But in response to the first answers interpretation to the first answers interpretation on the species and site scores in scaling 2, I believe their explanation is backwards.
In the book "Numerical Ecology with R" by Borcard, Gillet, and Legendre they clearly state that in scaling 2 species scores are weighted averages of the sites.
This can be confirmed when using the ordihull funtion in CCA.
Also in the output from the OP states that species scores are scaled and site scores are unscaled. which I believe confirms what the book says.
"Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions"

Related

R CCA - Can species scores be related to CCA axis & how does the biplot arrow length relates to significance of variables?

Hallo this is my first question in stackoverflow or any simliar forum, so please excuse and be kind if I missed something out ;)
I am using the vegan package in R to calculate a cca analysis. Because my study is about intraspecific variation of species traits, I do not have a "plot X species- matrix" but an "individuum X trait- matrix" representing a "physio-chemcial-niche" (so my species scores look different than they used to).
So my questions are:
is it appropiate to do this analysis in this way?
Is it possible to interpret the CCA axis based on the "species scores" (which are not species scores in my case) - I would like to have informations like: CCA1 is most related to trait X.
How can I interpret the length of the biplot arrows in comparison to premutaion test (anova.cca) - Because I get many "long" arrows but looking at the permutation test only few of them are significant?
Here is my summary(cca)-Output:
Call:
cca(formula = mniche_g ~ cover_total * Richness + altitude + Eastness + lan_TEMP + lan_REACT + lan_NUTRI + lan_MOIST + Condition(glacier/transect/plot/individuum), data = mres_g_sc)
Partitioning of scaled Chi-square:
Inertia Proportion
Total 0.031551 1.00000
Conditioned 0.001716 0.05439
Constrained 0.006907 0.21890
Unconstrained 0.022928 0.72670
Eigenvalues, and their contribution to the scaled Chi-square
after removing the contribution of conditiniong variables
Importance of components:
CCA1 CCA2 CCA3 CA1 CA2 CA3
Eigenvalue 0.00605 0.0005713 0.0002848 0.0167 0.00382 0.002413
Proportion Explained 0.20280 0.0191480 0.0095474 0.5596 0.12805 0.080863
Cumulative Proportion 0.20280 0.2219458 0.2314932 0.7911 0.91914 1.000000
Accumulated constrained eigenvalues
Importance of components:
CCA1 CCA2 CCA3
Eigenvalue 0.00605 0.0005713 0.0002848
Proportion Explained 0.87604 0.0827150 0.0412425
Cumulative Proportion 0.87604 0.9587575 1.0000000
Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions
Species scores
CCA1 CCA2 CCA3 CA1 CA2 CA3
SLA_range_ind 0.43964 -0.002623 -0.0286814 -0.75599 -0.04823 0.003317
SLA_mean_ind 0.01771 -0.042969 0.0246679 -0.01180 0.12732 0.053094
LNC -0.10613 -0.064207 -0.0637272 0.07261 -0.15962 0.198612
LCC -0.01375 0.012131 -0.0005462 0.02573 -0.01539 -0.021314
...
Here is my anova.cca(cca)-Output:
Permutation test for cca under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
Model: cca(formula = mniche_g ~ cover_total * Richness + altitude + Eastness + lan_TEMP + lan_REACT + lan_NUTRI + lan_MOIST + Condition(glacier/transect/plot/individuum), data = mres_g_sc)
Df ChiSquare F Pr(>F)
cover_total 1 0.0023710 10.4442 0.002 **
Richness 1 0.0006053 2.6663 0.080 .
altitude 1 0.0022628 9.9676 0.001 ***
Eastness 1 0.0005370 2.3657 0.083 .
lan_TEMP 1 0.0001702 0.7497 0.450
lan_REACT 1 0.0005519 2.4313 0.094 .
lan_NUTRI 1 0.0000883 0.3889 0.683
lan_MOIST 1 0.0001017 0.4479 0.633
cover_total:Richness 1 0.0002184 0.9620 0.351
Residual 101 0.0229283
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
and here the biplot:
enter image description here
Thank you all!
I don't have sufficient information to say if it is sensible to use CCA for your data. I'm suspicious and I think it may not be sensible. The critical question is "does the sum of traits make sense?". If it does not, *CA makes no sense, because you need both row and column sums there, and if you have measured your variables in different units, their sum makes no sense. For instance, if you change the units of one "trait", say, from inches to centimetres, the results will change. It is probable wiser to use RDA/PCA with equalizing scaling of variables.
You can get the relationship of single variable to an axis from the analysis. It is just the ordination score of that variable. Visually you see it by projecting the point to the axis, numerically with summary or scores. However, I think you should not want to have that, but I won't stop you if you do something that I think you should not do. (Interpretation of rotated dimensions may be more meaningful – axes are just a framework of reference to draw plots.)
Brief answer: the arrow lengths have no relation to the so-called significances. Longer answer: The scaling of biplot arrow lengths depends on the scaling of your solution and the number of constrained axes in your solution. The biplot scores are based on the relationship with the so-called Linear Combination scores – which are completely defined by these very same variables, and the multiple correlation of the constraining variable with all constrained axes is 1. In the default scaling ("species"), all your biplot arrows have unit length in the full constrained solution, but if you show only two of several axes, the arrows appear shorter if they are long in the dimensions that you do not show, and they appear long, if the dimensions you show are the only ones that are important for these variables. With other scalings, you also add scaling by axis eigenvalues. However, these lengths have nothing to do with so-called significances of these variables. (BTW, you used sequential method in your significance tests which means that the testing order will influence the results. This is completely OK, but different from interpreting arrows which are not order-dependent.)

Cannonical Correspondence Analysis (CCA) using R

I use R for my data analysis, and I found an error when I use vegan packages for calculating and plotting CCA (Canonical Correspondence Analysis). One of all variable was loose and the eigenvalues different.
These is eigenvalues when I use R
Eigenvalues for constrained axes:
CCA1 CCA2 CCA3 CCA4 CCA5
0.18496 0.02405 0.01492 0.01103 0.00260
And this is eigenvalues when I use PAST
Axis Eigenvalue %
1 0.11343 74.19
2 0.023363 15.28
3 0.011034 7.217
4 0.0050609 3.31
5 1.2233E-10 8.002E-08
I don't know about my fault. I need some problem solving

cca using vegan : no unconstrained inertia

I run a CCA analysis using vegan with 7500 sites, 9 species and 5 constrains variable. The results are
>
Call: cca(sitspe = Yp, sitenv = Xp)
Inertia Proportion Rank
Total
Constrained 0.5051 6
Inertia is mean squared contingency coefficient
Eigenvalues for constrained axes:
[1] 0.3317 0.1301 0.0328 0.0089 0.0014 0.0003
I don't understand why there is no unconstrained or total inertia?
Probably your constrained axes explained everything, and no constrained inertia was left. How many axes did you get with unconstrained ordination (CCA without constraints)?
Your data are really non-square: matrix dimensions are 7500 times 9. There are only nine species, and if these are dependent or otherwise redundant, you may be able to explain everything with your constraints.

pca of psych r package: how to obtain only total % explained variance and model fit measure?

In a shiny app I am building I want to show only the explained variance and the model fit measure of the output of the principal function (of the r package psych). I investigated the structure of the output, but unfortenately (and perhaps a bit strangely) I couldn't find the exact spot of these values. Does anyone have an idea how to obtain these values from the output?
First, if you expect help, you should provide a reproducible example, which includes a sample of your data. This is why your question was downvoted (not by me though).
The variance due to the ith principal component is given by the ith eigenvalue of the correlation matrix. Since the PCs are orthogonal (uncorrelated) by definition, the total variance is given by the sum of the individual variances = the sum of the eigenvalues. The eigenvalues are returned in principal(...)$values. So the proportion of total variance explained by each PC is given by:
prop.table(principal(...)$values)
Since you didn't provide any data, I'll use the built-in mtcars dataset for a working example:
library(psych)
df <- mtcars[c("hp","mpg","disp","wt","qsec")]
pca <- principal(df)
prop.table(pca$values)
# [1] 0.73936484 0.19220335 0.03090626 0.02623083 0.01129473
So the first PC explains 74% of the total variation, the second PC explains 19%, etc. This agrees perfectly with the result using prcomp(...), keeping in mind that principal(...) scales by default, while prcomp(...) does not.
pc <- prcomp(df,scale.=T)
summary(pc)
# Importance of components:
# PC1 PC2 PC3 PC4 PC5
# Standard deviation 1.9227 0.9803 0.39310 0.36215 0.23764
# Proportion of Variance 0.7394 0.1922 0.03091 0.02623 0.01129
# Cumulative Proportion 0.7394 0.9316 0.96247 0.98871 1.00000
The parameter "Fit based upon off diagonal values" is given in principal(...)$fit.off, as explained in the documentation.

Scaling of covariance matrices

For the question "Ellipse around the data in MATLAB", in the answer given by Amro, he says the following:
"If you want the ellipse to represent
a specific level of standard
deviation, the correct way of doing is
by scaling the covariance matrix"
and the code to scale it was given as
STD = 2; %# 2 standard deviations
conf = 2*normcdf(STD)-1; %# covers around 95% of population
scale = chi2inv(conf,2); %# inverse chi-squared with dof=#dimensions
Cov = cov(X0) * scale;
[V D] = eig(Cov);
I don't understand the first 3 lines of the above code snippet. How is the scale calculated by chi2inv(conf,2), and what is the rationale behind multiplying it with the covariace matrix?
Additional Question:
I also found that if I scale it with 1.5 STD, i.e. 86% tiles, the ellipse can cover all of the points, my points set are clumping together, at almost all the cases. On the other hand, if I scale it with 3 STD, i.e. 99%tiles, the ellipse is far too big. Then how can I choose a STD to just tightly cover the clumping points?
Here is an example:
The inner ellipse corresponds to 1.5 STD and outer to 2.5 STD. why 1.5 STD is tightly cover the clumping white points? Is there any approach or reason to define it?
The objective of displaying an ellipse around the data points is to show the confidence interval, or in other words, "how much of the data is within a certain standard deviation way from the mean"
In the above code, he has chosen to display an ellipse that covers 95% of the data points. For a normal distribution, ~67% of the data is 1 s.d. away from the mean, ~95% within 2 s.d. and ~99% within 3 s.d. (the numbers are off the top of my head, but you can easily verify this by calculating the area under the curve). Hence, the value STD=2; You'll find that conf is approx 0.95.
The distance of the data points from the centroid of the data goes something like (xi^2+yi^2)^0.5, ignoring coefficients. Sums of squares of random variables follow a chi-square distribution and hence to get the corresponding 95 percentile, he uses the inverse chi-square function, with d.o.f. 2, as there are two variables.
Lastly, the rationale behind multiplying the scaling constant follows from the fact that for a square matrix A with eigenvalues a1,...,an, the eigenvalues of a matrix kA, where k is a scalar is simply ka1,...,kan. The eigenvalues give the corresponding lengths of the major/minor axis of the ellipse, and so scaling the ellipse or the eigenvalues to the 95%tile is equivalent to multiplying the covariance matrix with the scaling factor.
EDIT
Cheng, although you might already know this, I suggest that you also read this answer to a question on randomness. Consider a Gaussian random variable with zero mean, unit variance. The PDF of a collection of such random variables looks like this
Now, if I were to take two such collections of random variables, square them separately and add them to form a single collection of a new random variable, its distribution looks like this
This is the chi-square distribution with 2 degrees of freedom (since we added two collections).
The equation of the ellipse in the above code can be written as x^2/a^2 +y^2/b^2=k, where x,y are the two random variables, a and b are the major/minor axes, and k is some scaling constant that we need to figure out. As you can see, the above can be interpreted as squaring and adding two collections of Gaussian random variables, and we just saw above what its distribution looks like. So, we can say that k is a random variable that is chi-square distributed with 2 degrees of freedom.
Now all that needs to be done is to find a value for k such that 95%ile of the data is within it. Just like the 1s.d, 2s.d, 3s.d. percentiles that we're familiar with Gaussians, the 95%tile for chi-square with 2 degrees of freedom is around 6.18. This is what Amro obtains from the chi2inv function. He could have just as well written scale=chi2inv(0.95,2) and it would have been the same. It's just that talking in terms of n s.d. away from the mean is intuitive.
Just to illustrate, here's a PDF of the chi-square distribution above, with 95% of the area < some x shaded in red. This x is ~6.18.
Hope this helped.

Resources