DCA Vegan Small DCA1 Eigenvalues resulting in weird plot - r

Hello StackOverflow community,
5 weeks ago I learned to write and read R and it made me a happier being :) Stack Overflow helped me out a hundred times or more! For a while I have been struggling with vegan now. So far I have succeeded in making beautiful nMDS plots. The next step for me is DCA, but here I run into trouble...
Let me explain:
I have a abundance dataset where the columns are different species (N=120) and the rows are transects (460). Column 1 with transect codes is deleted. Abundance is in N (not relative or transformed). Most species are rare to very rare and a couple of species have very high abundance (10000-30000). Total N individuals is about 100000.
When I run the decorana function it returns this info.
decorana(veg = DCAMVA)
Detrended correspondence analysis with 26 segments.
Rescaling of axes with 4 iterations.
DCA1 DCA2 DCA3 DCA4
Eigenvalues 0.7121 0.4335 0.1657 0.2038
Decorana values 0.7509 0.4368 0.2202 0.1763
Axis lengths 1.7012 4.0098 2.5812 3.3408
The eigenvalues are however really small... Only 1 species has a DCA1 value of 2 the rest is all -1.4E-4 etc... This high DCA1 point has an abundance of 1 individual... But this is not the only species that has only 1 individual..
DCA1 DCA2 DCA3 DCA4 Totals
almaco.jack 6.44e-04 1.85e-01 1.37e-01 3.95e-02 0
Atlantic.trumpetfish 4.21e-05 5.05e-01 -6.89e-02 9.12e-02 104
banded.butterflyfish -4.62e-07 6.84e-01 -4.04e-01 -2.68e-01 32
bar.jack -3.41e-04 6.12e-01 -2.04e-01 5.53e-01 91
barred.cardinalfish -3.69e-04 2.94e+00 -1.41e+00 2.30e+00 15
and so on
I can't plot the picture yet on StackOverflow, but the idea is that there is spread on the Y-axis, but the X-values are not. Resulting in a line in the plot.
I guess everything is running okay, no errors returned or so.. I only really wonder what the reason for this clustering is... Anybody has any clue?? Is there a ecological idea behind this??
Any help is appreciated :)
Love
Erik

Looks like your data has an "outlier", a deviant site with deviant species composition. DCA has essentially selected the first axis to separate this site from everything else, and then DCA2 reflects a major pattern of variance in the remaining sites. (D)CA is known to suffer (if you want to call it that) from this problem, but it is really telling you something about your data. This likely didn't affect NMDS at all because metaMDS() maps the rank order of the distances between samples and that means it only need to put this sample slightly further away from any other sample than the distance between the next two most dissimilar samples.
You could just stop using (D)CA for these sorts of data and continue to use NMDS via metaMDS() in vegan. An alternative is to apply a transformation such as the Hellinger transformation and then use PCA (see Legendre & Gallagher 2001, Oecologia, for the details). This transformation can be applied via decostand(...., method = "hellinger") but it is trivial to do by hand as well...

Related

How to include plots / rows with zero values in the presence / absence community matrix in a CCA using R Vegan package

I am trying to do CCA using a presence / absence matrix of plant quadrat data and continuous environmental data for the same quadrats, using the Vegan package in R. Some of the quadrats have no plant species present (the row for the quadrat is full of 0's) but do have corresponding environmental data in another dataframe. The context of the study is that the environmental data is metal concentrations in soil, which are typically high where there are no plant species, so the quadrats with zero species do contribute to the data, and are not errors or NA's. When running the CCA with the R Vegan Package so far I have had to delete these rows to get it to work, otherwise it returns the error
'Error in cca.default(d$X, d$Y, d$Z) :
all row sums must be >0 in the community data matrix' .
Is there a way to include the data from quadrats that have no plant species in the CCA? I have read in this paper, which also uses the Vegan package,: https://www.researchgate.net/publication/229087061_Relationships_between_the_presence_of_odonate_species_and_environmental_characteristics_in_lowland_ponds_of_central_Italy and that has a similar research design, that they have included plots with zero species by adding a 'zero species' variable but do not elaborate on how this is done.
I am new to coding so any help is very much appreciated,
Thanks in advance
Here is how to do it. Assume your data set is called comm and it has some rows (sampling units) that have no species:
comm$ZERO <- as.numeric(rowSums(comm) == 0)
This will add a new column ZERO which is 1 for rows that had no species, and 0 for others.
Personally, I would be worried about doing this. Correspondence Analysis is a compositional analysis, and adding a column (species) that never occurs with any other species (by definition) creates a data set with two disjunct blocks. In unconstrained CA this disjunct block manifests in first eigenvalue 1 – which is the theoretical maximum in CA. This first eigenvector will separate the blocks: ZERO species and the sampling units with ZERO species in one extreme, and all other species and sampling units in another extreme of the first axis. The second axis of this ZERO ordination will be identical to the first axis without ZERO, so in effect you just add this disjunction axis to the ordination.
Things are slightly different with CCA which actually looks at the fitted values of your species, and these fitted values may not be disjunct. So technically you can do it. However, it is not quite clear to me what you do if you do so. Even if the data set is not completely disjunct with CCA, the zero sampling units will probably be far separated from other points, and all plotted in the same point.

Testing species association and clustering : Multivariate analyses / weighted association test

I am unsure as to which test to use in R. Here is the oversimplified sampling procedure, with easy-looking numbers :
We have 20 patches of the same size in a field.
Inside these patches, we look for 50 different species (10 species of grass, and 40 species of insect).
Every time we find a species of grass, we record its coverage on a rough logarithmic scale from 1 to 4.
Every time we find a species of insect, we count them and record their abundance on a rough logarithmic scale from 1 to 4.
So my data sort of looks like this:
My problem is, how do I test which species are significantly associated? How do I detect clusters? Multivariate analysis? Half weight index? Bootstrap?
I'm not exactly gifted when it comes to statistics, so any help would be greatly appreciated!

Using envfit (vegan) to calculate species scores

I am running an NMDS and have a few questions regarding the envfit() function in the vegan package. I have read the documentation for this function and numerous posts on SO and others about vegan, envfit(), and species scores in general.
I have seen both envfit() and wascore() used to calculate species scores for ordination techniques. By default, metaMDS() uses wascore(). This uses weighted averaging, which I understand. I am having a harder time understanding envfit(). Do envfit() and wascore( yield the same results? Is wascore() preferable given that it is the default? I realize that in some situations, wascore() might not be an option (ie. negative values), as mentioned in this post. How to get 'species score' for ordination with metaMDS()?
Given that envfit() and wascore() both seem to be used for species scores, they should yield similar results, right? I am hoping that we could do a proof of this here...
The following shows species scores determined using metaMDS() using the default wascore():
data(varespec)
ord <- metaMDS(varespec)
species.scores <- as.data.frame(scores(ord, "species"))
species.scores
wascore() makes sense to me, it uses weighted averaging. There is a good explanation of weighted averaging for species scores in Analysis of Ecological Data by McCune and Grace (2002) p. 150.
Could somebody help me breakdown envfit?
species.envfit <- envfit(ord, varespec, choices = c(1,2), permutations = 999)
species.scores.envfit <- as.data.frame(scores(species.envfit, display = "vectors"))
species.scores.envfit
"The values that you see in the table are the standardised coefficients from the linear regression used to project the vectors into the ordination. These are directions for arrows of unit length." - comment from Plotted envfit vectors not matching NMDS scores
^Could somebody please show me what linear model is being run here and what standardized value is being extracted?
species.scores
species.scores.envfit
These values are very different from each other. What am I missing here?
This is my first SO post, please have mercy. I would have asked a question on some of the other relevant threads, but I am the dregs of SO and don't even have the reputation to comment.
Thanks!
Q: Do wascores() and envfit() give the same result?
No they do not give the same result as these are doing two quite different things. In this answer I have explained how envfit() works. wascores() takes the coordinates of the points in the nmds space and computes the mean on each dimension, weighting observations by the abundance of the species at each point. Hence the species score returned by wascores() is a weighted centroid in the NMDS space for each species, where the weights are the abundances of the species. envfit() fits vectors that point in the direction of increasing abundance. This implies a plane over the NMDS ordination where abundance increase linearly from any point on the plane as you move parallel to the arrow, whereas wascores() are best thought of as optima, where the abundance declines as you move away from the weighted centroid, although I think this analogy is looser than say with a CA ordination.
The issue about being optimal or not, is an issue if you passed in standardised data; as the answer you linked to shows, this would imply negative weights which doesn't work. Typically one doesn't standardise species abundances — there are transformations that we apply like converting to proportions, square root or log transformations, normalizing the data to the interval 0-1 — but these wouldn't give you negative abundances so you;re less likely to run into that issue.
envfit() in an NMDS is not necessarily a good thing as we wouldn't expect abundances to vary linearly over the ordination space. The wascores() are better as they imply non-linear abundances, but they are a little hackish in NMDS. ordisurf() is a better option in general as it adds a GAM (smooth) surface instead of the plane implied by the vectors, but you can't show more than one or a few surfaces on the ordination, whereas you can add as many species WA scores or arrows as you want.
The basic issue here is the assumption that envfit() and wascores() should give the same results. There is no reason to assume that as these are fundamentally different approaches to computing "species scores" for NMDS and each comes with it's own assumptions and advantages and disadvantages.

PCoA function pcoa extract vectors; percentage of variance explained

I have a dataset consisting of 132 observations and 10 variables.
These variables are all categorical. I am trying to see how my observations cluster and how they are different based on the percentage of variance. i.e I want to find out if a) there are any variables which helps to draw certain observation points apart from one another and b) if yes, what is the percentage of variance explained by it?
I was advised to run a PCoA (Principle Coordinates Analysis) on my data. I ran it using vegan and ape package. This is my code after loading my csv file into r, I call it data
#data.dis<-vegdist(data,method="gower",na.rm=TRUE)
#data.pcoa<-pcoa(data.dis)
I was then told to extract the vectors from the pcoa data and so
#data.pcoa$vectors
It then returned me 132 rows but 20 columns of values (e.g. from Axis 1 to Axis 20)
I was perplexed over why there were 20 columns of values when I only have 10 variables. I was under the impression that I would only get 10 columns. If any kind souls out there could help to explain a) what do the vectors actually represent and b) how do I get the percentage of variance explained by Axis 1 and 2?
Another question that I had was I don't really understand the purpose of extracting the eigenvalues from data.pcoa because I saw some websites doing that after running a pcoa on their distance matrix but there was no further explanation on it.
Gower index is non-Euclidean and you can expect more real axes than the number of variables in Euclidean ordination (PCoA). However, you said that your variables are categorical. I assume that in R lingo they are factors. If so, you should not use vegan::vegdist() which only accepts numeric data. Moreover, if the variable is defined as a factor, vegan::vegdist() refuses to compute the dissimilarities and gives an error. If you managed to use vegdist(), you did not properly define your variables as factors. If you really have factor variables, you should use some other package than vegan for Gower dissimilarity (there are many alternatives).
Te percentage of "variance" is a bit tricky for non-Euclidean dissimilarities which also give some negative eigenvalues corresponding to imaginary dimensions. In that case, the sum of all positive eigenvalues (real axes) is higher than the total "variance" of data. ape::pcoa() returns the information you asked in the element values. The proportion of variances explained is in its element values$Relative_eig. The total "variance" is returned in element trace. All this was documented in ?pcoa where I read it.

Cross-correlation of 5 time series (distance) and interpretation

I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!
You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.

Resources