How to include plots / rows with zero values in the presence / absence community matrix in a CCA using R Vegan package - r

I am trying to do CCA using a presence / absence matrix of plant quadrat data and continuous environmental data for the same quadrats, using the Vegan package in R. Some of the quadrats have no plant species present (the row for the quadrat is full of 0's) but do have corresponding environmental data in another dataframe. The context of the study is that the environmental data is metal concentrations in soil, which are typically high where there are no plant species, so the quadrats with zero species do contribute to the data, and are not errors or NA's. When running the CCA with the R Vegan Package so far I have had to delete these rows to get it to work, otherwise it returns the error
'Error in cca.default(d$X, d$Y, d$Z) :
all row sums must be >0 in the community data matrix' .
Is there a way to include the data from quadrats that have no plant species in the CCA? I have read in this paper, which also uses the Vegan package,: https://www.researchgate.net/publication/229087061_Relationships_between_the_presence_of_odonate_species_and_environmental_characteristics_in_lowland_ponds_of_central_Italy and that has a similar research design, that they have included plots with zero species by adding a 'zero species' variable but do not elaborate on how this is done.
I am new to coding so any help is very much appreciated,
Thanks in advance

Here is how to do it. Assume your data set is called comm and it has some rows (sampling units) that have no species:
comm$ZERO <- as.numeric(rowSums(comm) == 0)
This will add a new column ZERO which is 1 for rows that had no species, and 0 for others.
Personally, I would be worried about doing this. Correspondence Analysis is a compositional analysis, and adding a column (species) that never occurs with any other species (by definition) creates a data set with two disjunct blocks. In unconstrained CA this disjunct block manifests in first eigenvalue 1 – which is the theoretical maximum in CA. This first eigenvector will separate the blocks: ZERO species and the sampling units with ZERO species in one extreme, and all other species and sampling units in another extreme of the first axis. The second axis of this ZERO ordination will be identical to the first axis without ZERO, so in effect you just add this disjunction axis to the ordination.
Things are slightly different with CCA which actually looks at the fitted values of your species, and these fitted values may not be disjunct. So technically you can do it. However, it is not quite clear to me what you do if you do so. Even if the data set is not completely disjunct with CCA, the zero sampling units will probably be far separated from other points, and all plotted in the same point.

Related

When setting your obsCovs for the function pcount (package unmarked) how does R "know" which obsCov observation corresponds to each y value?

I'm relatively new at R particularly with this package. I am running n-mixture models assessing detection probabilities and abundance. I have abundance data, site covariates and observation covariates. There are three repeated observations(rounds)/site. The observation covariates are set up as columns (three column/covariate, one for each round). The rows are individual sites. The abundance data is formatted similarly, with each column heading representing a different round. I've copied my code below.
y.abun2<-COYE[2:4]
obsCovs.ss <- list(temp=Covariate2021[3:5], Date=Covariate2021[13:15], Cloud=Covariate2021[17:19], Wind=Covariate2021[21:23],Observ=Covariate2021[25:27])
siteCovs.ss <- Covariate2021[c(29,30,31,32)]
coyeabund<-unmarkedFramePCount(y=y.abun2, siteCovs = siteCovs.ss,
obsCovs = obsCovs.ss)
After this I scale using this code:
coyeabund#siteCovs$TreeCover <-
scale(coyeabund#siteCovs$TreeCover)
Moving on to my model I use this code:
abun.coye.full<-pcount(~TreeCover+temp+Date+Cloud+Wind+Observ ~ HHSDI+ProportionNH+Quality, coyeabund,mixture="NB", K=132,se=TRUE)
Is the model matching the observation covariates to the abundance measurements to each round? (i.e., is it able to tell that temp column 5 corresponds to the third round of abundance measurements?)
The models seem fine so far but I am so new at this I want to confirm that I haven't gone astray.

Display the name of corresponding PC when using prcomp for PCA in r

I use prcomp to run PCA in r. When I output the summary, i.e. standard deviation, proportion of variance, cumulative proportion, the results are always ordered and the actual column name is replaced by PC1, PC2. Thus, I cannot tell the exact proportion of variance for each column.
Can anyone show me or give me some hint on how to display the column when outputting summary results. Two results pics are attached here:
It is not clear that you understand what principal components does. It reduces the dimensionality of the data. Assuming the rows are observations and the columns are variables, imagine plotting your rows in 35 dimensions (the columns). Most people have trouble visualizing more than 3 dimensions. Principal components creates a smaller set of axes that explains most the the variation in the data. The axes are Euclidian meaning they are at right angles to one another. Your plot and the result of the summary(res.pca5) and plot(res.pca5) functions show that the first dimension explains 28% of the variation in the 35 variables. Adding a second dimension gives you almost 38% and three gives you 44%. These new variables are combinations of your original variables, not the original variables. The first two components explain more of the variability than any other combination.
For some reason you did not try res.pca5 as a command (or the equivalent print(res.pca5)) which would show you the coefficients that pca used to create the components from the original variables or biplot(res.pca5) which plots the rows and columns in the new two dimensional space.

PCoA function pcoa extract vectors; percentage of variance explained

I have a dataset consisting of 132 observations and 10 variables.
These variables are all categorical. I am trying to see how my observations cluster and how they are different based on the percentage of variance. i.e I want to find out if a) there are any variables which helps to draw certain observation points apart from one another and b) if yes, what is the percentage of variance explained by it?
I was advised to run a PCoA (Principle Coordinates Analysis) on my data. I ran it using vegan and ape package. This is my code after loading my csv file into r, I call it data
#data.dis<-vegdist(data,method="gower",na.rm=TRUE)
#data.pcoa<-pcoa(data.dis)
I was then told to extract the vectors from the pcoa data and so
#data.pcoa$vectors
It then returned me 132 rows but 20 columns of values (e.g. from Axis 1 to Axis 20)
I was perplexed over why there were 20 columns of values when I only have 10 variables. I was under the impression that I would only get 10 columns. If any kind souls out there could help to explain a) what do the vectors actually represent and b) how do I get the percentage of variance explained by Axis 1 and 2?
Another question that I had was I don't really understand the purpose of extracting the eigenvalues from data.pcoa because I saw some websites doing that after running a pcoa on their distance matrix but there was no further explanation on it.
Gower index is non-Euclidean and you can expect more real axes than the number of variables in Euclidean ordination (PCoA). However, you said that your variables are categorical. I assume that in R lingo they are factors. If so, you should not use vegan::vegdist() which only accepts numeric data. Moreover, if the variable is defined as a factor, vegan::vegdist() refuses to compute the dissimilarities and gives an error. If you managed to use vegdist(), you did not properly define your variables as factors. If you really have factor variables, you should use some other package than vegan for Gower dissimilarity (there are many alternatives).
Te percentage of "variance" is a bit tricky for non-Euclidean dissimilarities which also give some negative eigenvalues corresponding to imaginary dimensions. In that case, the sum of all positive eigenvalues (real axes) is higher than the total "variance" of data. ape::pcoa() returns the information you asked in the element values. The proportion of variances explained is in its element values$Relative_eig. The total "variance" is returned in element trace. All this was documented in ?pcoa where I read it.

How to specify subset/ sample number for permutations using specaccum() in R's vegan package

I have a community matrix (species as columns, samples as rows) from which I would like to generate a species accumulation curve (SAC) using the specaccum() and fitspecaccum() functions in R's vegan package. In order for the resulting SAC and cumulative species richness at sample X to be comparable among regions (I have 1 community matrix per region), I need to have specaccum() choose the same number of sets within each region. My problem is that some regions have a larger number of sets than others. I would like to limit the sample size to the minimum number of sets among regions (in my case, the minimum number of sets is 45, so I would like specaccum() to randomly sample 45 sets, 100 times (set permutations=100) for each region. I would like to sample from the entire data set available for each region. The code below has not worked... it doesn't recognize "subset=45". The vegan package info says "subset" needs to be logical... I don't understand how subset number can be logical, but maybe I am misinterpreting what subset is... Is there another way to do this? Would it be sufficient to run specaccum() for the entire number of sets available for each region and then just truncate the output to 45?
require(vegan)
pool1<-specaccum(comm.matrix, gamma="jack1", method="random", subet=45, permutations=100)
Any help is much appreciated.
Why do you want to limit the function to work in a random sample of 45 cases? Just use the species accumulation up to 45 cases. Taking a random subset of 45 cases gives you the same accumulation, except for the random error of subsampling and throwing away information. If you want to compare your different cases, just compare them at the sample size that suits all cases, that is, at 45 or less. That is the idea of species accumulation models.
The subset is intended for situations where you have (possibly) heterogeneous collection of sampling units, and you want to stratify data. For instance, if you want to see only the species accumulation in the "OldLow" habitat type of the Barro Colorado data, you could do:
data(BCI, BCI.env)
plot(specaccum(BCI, subset = BCI.env$Habitat == "OldLow"))
If you want to have, say, a subset of 30 sample plots of the same data, you could do:
take <- c(rep(TRUE, 30), rep(FALSE, 20))
plot(specaccum(BCI)) # to see it all
# repeat the following to see how taking subset influences
lines(specaccum(BCI, subset = sample(take)), col = "blue")
If you repeat the last line, you see how taking random subset influences the results: the lines are normally within the error bars of all data, but differ from each other due to random error.

cluster ordinal data

I want to do clustering of my data (kmeans or hclust) in R language (coding). My data is ordinal, which means that the data is Likert scale to measure the causes of cost escalation (I have 41 causes "variables") that scaled from 1 to 5, which 1 is no effect to 5 major effect (I have about 160 observations "who rank the causes")... any help of how to cluster the 41 cause based on the observations ... do I have to convert the scale to percentage or z score before clustering or any thing that help ...... I really need your help!! here is the data to play with https://docs.google.com/spreadsheet/ccc?key=0AlrR2eXjV8nXdGtLdlYzVk01cE96Rzg2NzRpbEZjUFE&usp=sharing
I want to cluster the variables (the columns) in terms of similarity of occurrence in observations... I follow the code in statmethods.net/advstats/cluster.html; but I couldn't cluster the variables (the columns) in terms of similarity of occurrence in observations and also I follow the work at mattpeeples.net/kmeans.html#help; but I don't know why he convert the data to percentage and then to Z-score standardize.
It isn't clear to me if you want to cluster the rows (the observations) in terms of similarity in the variables, or cluster the variables (the columns) in terms of similarity of occurrence in observations?
Anyway, see package cluster. This is a recommended package that comes with all R installations.
Read ?daisy for details of what is done with ordinal data. This metric can be used in functions such as agnes (for hierarchical clustering) or pam (for partitioning about medoids, a more robust version of k-means).
By default, these will cluster the rows/observations. Simply transpose the data object using t() if you want to cluster the columns (variables). Although that may well mess up the data depending on how you have stored them.
Converting the data to percentage is called normalization of data so all the variables are in the range of 0 - 1.
If data is not normalized you run the risk of bias towards dimensions with large values

Resources