Cannonical Correspondence Analysis (CCA) using R - r

I use R for my data analysis, and I found an error when I use vegan packages for calculating and plotting CCA (Canonical Correspondence Analysis). One of all variable was loose and the eigenvalues different.
These is eigenvalues when I use R
Eigenvalues for constrained axes:
CCA1 CCA2 CCA3 CCA4 CCA5
0.18496 0.02405 0.01492 0.01103 0.00260
And this is eigenvalues when I use PAST
Axis Eigenvalue %
1 0.11343 74.19
2 0.023363 15.28
3 0.011034 7.217
4 0.0050609 3.31
5 1.2233E-10 8.002E-08
I don't know about my fault. I need some problem solving

Related

R Studio CCA of community indices and landscape metrics

I'm currently trying to perform a CCA using R studio with a range of environmental and biodiversity variables however, while I encounter no coding errors, the result doesn't seem to be correct.
Now I'm no whiz when it comes to the fundamentals of stats if I'm being honest, so was hoping someone may be able to explain to me the issue I'm facing here.
Here is the code.
Arthropod.cca <- cca(Biodiversity ~ D1+MPI+SDI+SEI+AWMPFD, data=Environment)
Arthropod.cca
plot(Arthropod.cca)
summary(Arthropod.cca)
Biodiversity is the name of my community structure dataset. It includes values for Menhinick, Shannon's indices and also Hill's ratio for evenness.
D1+MPI+SDI+SEI+AWMPFD are my environmental variables looking at landscape diversity, evenness, fragmentation etc.
However, R just gives me this back.
Inertia Proportion Rank
Total 0.008392 1.000000
Constrained 0.008392 1.000000 2
Unconstrained 0.000000 0.000000 0
Inertia is scaled Chi-square
Some constraints or conditions were aliased because they were redundant
Eigenvalues for constrained axes:
CCA1 CCA2
0.008388 0.000004
I originally had 5 conditions, however, this has been reduced to only 2, with wildly different eigenvalues. Just overall extremely confused about this.

R CCA - Can species scores be related to CCA axis & how does the biplot arrow length relates to significance of variables?

Hallo this is my first question in stackoverflow or any simliar forum, so please excuse and be kind if I missed something out ;)
I am using the vegan package in R to calculate a cca analysis. Because my study is about intraspecific variation of species traits, I do not have a "plot X species- matrix" but an "individuum X trait- matrix" representing a "physio-chemcial-niche" (so my species scores look different than they used to).
So my questions are:
is it appropiate to do this analysis in this way?
Is it possible to interpret the CCA axis based on the "species scores" (which are not species scores in my case) - I would like to have informations like: CCA1 is most related to trait X.
How can I interpret the length of the biplot arrows in comparison to premutaion test (anova.cca) - Because I get many "long" arrows but looking at the permutation test only few of them are significant?
Here is my summary(cca)-Output:
Call:
cca(formula = mniche_g ~ cover_total * Richness + altitude + Eastness + lan_TEMP + lan_REACT + lan_NUTRI + lan_MOIST + Condition(glacier/transect/plot/individuum), data = mres_g_sc)
Partitioning of scaled Chi-square:
Inertia Proportion
Total 0.031551 1.00000
Conditioned 0.001716 0.05439
Constrained 0.006907 0.21890
Unconstrained 0.022928 0.72670
Eigenvalues, and their contribution to the scaled Chi-square
after removing the contribution of conditiniong variables
Importance of components:
CCA1 CCA2 CCA3 CA1 CA2 CA3
Eigenvalue 0.00605 0.0005713 0.0002848 0.0167 0.00382 0.002413
Proportion Explained 0.20280 0.0191480 0.0095474 0.5596 0.12805 0.080863
Cumulative Proportion 0.20280 0.2219458 0.2314932 0.7911 0.91914 1.000000
Accumulated constrained eigenvalues
Importance of components:
CCA1 CCA2 CCA3
Eigenvalue 0.00605 0.0005713 0.0002848
Proportion Explained 0.87604 0.0827150 0.0412425
Cumulative Proportion 0.87604 0.9587575 1.0000000
Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions
Species scores
CCA1 CCA2 CCA3 CA1 CA2 CA3
SLA_range_ind 0.43964 -0.002623 -0.0286814 -0.75599 -0.04823 0.003317
SLA_mean_ind 0.01771 -0.042969 0.0246679 -0.01180 0.12732 0.053094
LNC -0.10613 -0.064207 -0.0637272 0.07261 -0.15962 0.198612
LCC -0.01375 0.012131 -0.0005462 0.02573 -0.01539 -0.021314
...
Here is my anova.cca(cca)-Output:
Permutation test for cca under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
Model: cca(formula = mniche_g ~ cover_total * Richness + altitude + Eastness + lan_TEMP + lan_REACT + lan_NUTRI + lan_MOIST + Condition(glacier/transect/plot/individuum), data = mres_g_sc)
Df ChiSquare F Pr(>F)
cover_total 1 0.0023710 10.4442 0.002 **
Richness 1 0.0006053 2.6663 0.080 .
altitude 1 0.0022628 9.9676 0.001 ***
Eastness 1 0.0005370 2.3657 0.083 .
lan_TEMP 1 0.0001702 0.7497 0.450
lan_REACT 1 0.0005519 2.4313 0.094 .
lan_NUTRI 1 0.0000883 0.3889 0.683
lan_MOIST 1 0.0001017 0.4479 0.633
cover_total:Richness 1 0.0002184 0.9620 0.351
Residual 101 0.0229283
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
and here the biplot:
enter image description here
Thank you all!
I don't have sufficient information to say if it is sensible to use CCA for your data. I'm suspicious and I think it may not be sensible. The critical question is "does the sum of traits make sense?". If it does not, *CA makes no sense, because you need both row and column sums there, and if you have measured your variables in different units, their sum makes no sense. For instance, if you change the units of one "trait", say, from inches to centimetres, the results will change. It is probable wiser to use RDA/PCA with equalizing scaling of variables.
You can get the relationship of single variable to an axis from the analysis. It is just the ordination score of that variable. Visually you see it by projecting the point to the axis, numerically with summary or scores. However, I think you should not want to have that, but I won't stop you if you do something that I think you should not do. (Interpretation of rotated dimensions may be more meaningful – axes are just a framework of reference to draw plots.)
Brief answer: the arrow lengths have no relation to the so-called significances. Longer answer: The scaling of biplot arrow lengths depends on the scaling of your solution and the number of constrained axes in your solution. The biplot scores are based on the relationship with the so-called Linear Combination scores – which are completely defined by these very same variables, and the multiple correlation of the constraining variable with all constrained axes is 1. In the default scaling ("species"), all your biplot arrows have unit length in the full constrained solution, but if you show only two of several axes, the arrows appear shorter if they are long in the dimensions that you do not show, and they appear long, if the dimensions you show are the only ones that are important for these variables. With other scalings, you also add scaling by axis eigenvalues. However, these lengths have nothing to do with so-called significances of these variables. (BTW, you used sequential method in your significance tests which means that the testing order will influence the results. This is completely OK, but different from interpreting arrows which are not order-dependent.)

cca using vegan : no unconstrained inertia

I run a CCA analysis using vegan with 7500 sites, 9 species and 5 constrains variable. The results are
>
Call: cca(sitspe = Yp, sitenv = Xp)
Inertia Proportion Rank
Total
Constrained 0.5051 6
Inertia is mean squared contingency coefficient
Eigenvalues for constrained axes:
[1] 0.3317 0.1301 0.0328 0.0089 0.0014 0.0003
I don't understand why there is no unconstrained or total inertia?
Probably your constrained axes explained everything, and no constrained inertia was left. How many axes did you get with unconstrained ordination (CCA without constraints)?
Your data are really non-square: matrix dimensions are 7500 times 9. There are only nine species, and if these are dependent or otherwise redundant, you may be able to explain everything with your constraints.

how to interpret cca vegan output

I have performed a canonical correspondece analysis in R using the vegan package but i find the output very difficult to understand. The triplot is understandable, but all the numbers I get from the summary(cca) are confusing to me (as i've just started to learn about ordination techniques)
I would like to know how much of the variance in Y that is explained by X (in this case, the environmental variables) and which of the independent variables that are important in this model?
my output looks like this:
Partitioning of mean squared contingency coefficient:
Inertia Proportion
Total 4.151 1.0000
Constrained 1.705 0.4109
Unconstrained 2.445 0.5891
Eigenvalues, and their contribution to the mean squared contingency coefficient
Importance of components:
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7
Eigenvalue 0.6587 0.4680 0.34881 0.17690 0.03021 0.02257 0.0002014
Proportion Explained 0.1587 0.1127 0.08404 0.04262 0.00728 0.00544 0.0000500
Cumulative Proportion 0.1587 0.2714 0.35548 0.39810 0.40538 0.41081 0.4108600
CA1 CA2 CA3 CA4 CA5 CA6 CA7
Eigenvalue 0.7434 0.6008 0.36668 0.33403 0.28447 0.09554 0.02041
Proportion Explained 0.1791 0.1447 0.08834 0.08047 0.06853 0.02302 0.00492
Cumulative Proportion 0.5900 0.7347 0.82306 0.90353 0.97206 0.99508 1.00000
Accumulated constrained eigenvalues
Importance of components:
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7
Eigenvalue 0.6587 0.4680 0.3488 0.1769 0.03021 0.02257 0.0002014
Proportion Explained 0.3863 0.2744 0.2045 0.1037 0.01772 0.01323 0.0001200
Cumulative Proportion 0.3863 0.6607 0.8652 0.9689 0.98665 0.99988 1.0000000
Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions
Species scores
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6
S.marinoi -0.3890 0.39759 0.1080 -0.005704 -0.005372 -0.0002441
C.tripos 1.8428 0.23999 -0.1661 -1.337082 0.636225 -0.5204045
P.alata 1.6892 0.17910 -0.3119 0.997590 0.142028 0.0601177
P.seriata 1.4365 -0.15112 -0.8646 0.915351 -1.455675 -1.4054078
D.confervacea 0.2098 -1.23522 0.5317 -0.089496 -0.034250 0.0278820
C.decipiens 2.2896 0.65801 -1.0315 -1.246933 -0.428691 0.3649382
P.farcimen -1.2897 -1.19148 -2.3562 0.032558 0.104148 -0.0068910
C.furca 1.4439 -0.02836 -0.9459 0.301348 -0.975261 0.4861669
Biplot scores for constraining variables
CCA1 CCA2 CCA3 CCA4 CCA5 CCA6
Temperature 0.88651 0.1043 -0.07283 -0.30912 -0.22541 0.24771
Salinity 0.32228 -0.3490 0.30471 0.05140 -0.32600 0.44408
O2 -0.81650 0.4665 -0.07151 0.03457 0.20399 -0.20298
Phosphate 0.22667 -0.8415 0.41741 -0.17725 -0.06941 -0.06605
TotP -0.33506 -0.6371 0.38858 -0.05094 -0.24700 -0.25107
Nitrate 0.15520 -0.3674 0.38238 -0.07154 -0.41349 -0.56582
TotN -0.23253 -0.3958 0.16550 -0.25979 -0.39029 -0.68259
Silica 0.04449 -0.8382 0.15934 -0.22951 -0.35540 -0.25650
Which of all these numbers are important to my analysis?
/anna
How much variation is explained by X?
In a CCA, variance isn't variance in the normal sense. We express it as the "mean squared contingency coefficient", or "inertia". All the info you need to ascertain how much "variation" in Y is explained by X is contained in the section of the output that I reproduce below:
Partitioning of mean squared contingency coefficient:
Inertia Proportion
Total 4.151 1.0000
Constrained 1.705 0.4109
Unconstrained 2.445 0.5891
In this example there is total inertia 4.151 and your X variables (these are "Constraints") explain a total of 1.705 bits of inertia, which is about 41%, leaving about 59% unexplained.
The next section referring to eigenvalues allows you to see both in terms of inertia explained and proportion explained which axes contribute significantly to the explanatory "power" of the CCA (the Constrained part of the table above) and the unexplained "variance" (the Unconstrained part of the table above.
The next section contains the ordination scores. Think of these as the coordinates of the points in the triplot. For some reason you show the site scores in the output above, but they would normally be there. Note that these have been scaled - by default this is using scaling = 2 - so site points are at their weighted average of the species scores IIRC etc.
The "Biplot" scores are the locations of the arrow heads or the labels on the arrows - I forget exactly how the plot is drawn now.
Which of all these numbers are important to my analysis?
All of them are important - if you think the triplot is important an interpretable, it is based entirely on the information reported by summary(). If you have specific questions to ask of the data, then perhaps only certain sections will be of paramount importance to you.
However, StackOverflow is not the place to ask such questions of a statistical nature.
I don't have the ability to comment. But in response to the first answers interpretation to the first answers interpretation on the species and site scores in scaling 2, I believe their explanation is backwards.
In the book "Numerical Ecology with R" by Borcard, Gillet, and Legendre they clearly state that in scaling 2 species scores are weighted averages of the sites.
This can be confirmed when using the ordihull funtion in CCA.
Also in the output from the OP states that species scores are scaled and site scores are unscaled. which I believe confirms what the book says.
"Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions"

fitting student distribution with scale parameter

I have below x and y value and as you see x is mostly negative, basically I only have the left side of the PDF of my observed data.
I have to fit it with a student distribution, and find out the degree of freedom and scale parameter.
The problem is, the estimated distribution is gonna have a very small variance (ie. small scale parameter). So when I use the below method to fit the distribution, the nls fails to converge no matter what initial values I set.
I have used an extra parameter c in the below code because I scale the distribution by using this: dt(x/a,df). Therefore, in order to conserve the probability, I unavoidably have to time the output but a constant. I believe this extra parameter leads to a poor convergence, but I have no idea how to fit the distribution in a better way.
I have looked for distribution fitting package, but those packages require a complete distribution while I only have the left side of it.
x y
1 -0.0050 0.000000
2 -0.0045 26.723019
3 -0.0040 28.557704
4 -0.0035 41.085068
5 -0.0030 66.258445
6 -0.0025 81.129807
7 -0.0020 83.751611
8 -0.0015 130.378353
9 -0.0010 157.806018
10 -0.0005 201.505657
11 0.0000 949.650354
12 0.0005 193.721270
dat<-data.frame(x=x,y=y)
res<-nls( y~(dt(x/a,df)*c), dat,
start=list(a=0.000201, df=0.9, c=2104), trace = TRUE)

Resources