How to specify subset/ sample number for permutations using specaccum() in R's vegan package - r

I have a community matrix (species as columns, samples as rows) from which I would like to generate a species accumulation curve (SAC) using the specaccum() and fitspecaccum() functions in R's vegan package. In order for the resulting SAC and cumulative species richness at sample X to be comparable among regions (I have 1 community matrix per region), I need to have specaccum() choose the same number of sets within each region. My problem is that some regions have a larger number of sets than others. I would like to limit the sample size to the minimum number of sets among regions (in my case, the minimum number of sets is 45, so I would like specaccum() to randomly sample 45 sets, 100 times (set permutations=100) for each region. I would like to sample from the entire data set available for each region. The code below has not worked... it doesn't recognize "subset=45". The vegan package info says "subset" needs to be logical... I don't understand how subset number can be logical, but maybe I am misinterpreting what subset is... Is there another way to do this? Would it be sufficient to run specaccum() for the entire number of sets available for each region and then just truncate the output to 45?
require(vegan)
pool1<-specaccum(comm.matrix, gamma="jack1", method="random", subet=45, permutations=100)
Any help is much appreciated.

Why do you want to limit the function to work in a random sample of 45 cases? Just use the species accumulation up to 45 cases. Taking a random subset of 45 cases gives you the same accumulation, except for the random error of subsampling and throwing away information. If you want to compare your different cases, just compare them at the sample size that suits all cases, that is, at 45 or less. That is the idea of species accumulation models.
The subset is intended for situations where you have (possibly) heterogeneous collection of sampling units, and you want to stratify data. For instance, if you want to see only the species accumulation in the "OldLow" habitat type of the Barro Colorado data, you could do:
data(BCI, BCI.env)
plot(specaccum(BCI, subset = BCI.env$Habitat == "OldLow"))
If you want to have, say, a subset of 30 sample plots of the same data, you could do:
take <- c(rep(TRUE, 30), rep(FALSE, 20))
plot(specaccum(BCI)) # to see it all
# repeat the following to see how taking subset influences
lines(specaccum(BCI, subset = sample(take)), col = "blue")
If you repeat the last line, you see how taking random subset influences the results: the lines are normally within the error bars of all data, but differ from each other due to random error.

Related

How to include plots / rows with zero values in the presence / absence community matrix in a CCA using R Vegan package

I am trying to do CCA using a presence / absence matrix of plant quadrat data and continuous environmental data for the same quadrats, using the Vegan package in R. Some of the quadrats have no plant species present (the row for the quadrat is full of 0's) but do have corresponding environmental data in another dataframe. The context of the study is that the environmental data is metal concentrations in soil, which are typically high where there are no plant species, so the quadrats with zero species do contribute to the data, and are not errors or NA's. When running the CCA with the R Vegan Package so far I have had to delete these rows to get it to work, otherwise it returns the error
'Error in cca.default(d$X, d$Y, d$Z) :
all row sums must be >0 in the community data matrix' .
Is there a way to include the data from quadrats that have no plant species in the CCA? I have read in this paper, which also uses the Vegan package,: https://www.researchgate.net/publication/229087061_Relationships_between_the_presence_of_odonate_species_and_environmental_characteristics_in_lowland_ponds_of_central_Italy and that has a similar research design, that they have included plots with zero species by adding a 'zero species' variable but do not elaborate on how this is done.
I am new to coding so any help is very much appreciated,
Thanks in advance
Here is how to do it. Assume your data set is called comm and it has some rows (sampling units) that have no species:
comm$ZERO <- as.numeric(rowSums(comm) == 0)
This will add a new column ZERO which is 1 for rows that had no species, and 0 for others.
Personally, I would be worried about doing this. Correspondence Analysis is a compositional analysis, and adding a column (species) that never occurs with any other species (by definition) creates a data set with two disjunct blocks. In unconstrained CA this disjunct block manifests in first eigenvalue 1 – which is the theoretical maximum in CA. This first eigenvector will separate the blocks: ZERO species and the sampling units with ZERO species in one extreme, and all other species and sampling units in another extreme of the first axis. The second axis of this ZERO ordination will be identical to the first axis without ZERO, so in effect you just add this disjunction axis to the ordination.
Things are slightly different with CCA which actually looks at the fitted values of your species, and these fitted values may not be disjunct. So technically you can do it. However, it is not quite clear to me what you do if you do so. Even if the data set is not completely disjunct with CCA, the zero sampling units will probably be far separated from other points, and all plotted in the same point.

How to use the "how" function for an unbalanced repeated design

I have a set of control and treated plots which had been sampled during years. I run the prc function in the vegan package and want to perform a permutation test to check whether control vs treated plots significantly differ during years. As my data is unbalanced, I can not use strata function. my code look like:
library(vegan)
year=as.factor(c(rep(1995,8),rep(1999,8),rep(2001,8),rep(2013,4),rep(1995,4),
rep(1999,4),rep(2001,4),rep(2013,4)))
treatment=as.factor(c(rep("control",28),rep("treated",16)))
I've written this, but I'm sure that it is wrong because the treatment is missing here:
h1 <- how(within = Within(type = "series", mirror = F),
blocks = year, nperm = 999
)
Any suggestions is greatly appreciated.
Under the null hypothesis, samples from the control or treated groups are exchangeable and hence you don't want them in the permutation design; you really want to permute them to generate the permutation-based null distribution for the test statistic.
The permutation design is there to indicate what isn't exchangeable.
You haven't explained why you want samples within the blocks to be permuted in series; why are samples within years also time series? If they're not, you don't want this.
You only need to worry about imbalance if you want to permute the strata. Whilst using blocks is similar in some respects to strata, blocks are never permuted so if you can use blocks you can use strata as you won't be permuting them.
If you want to permute the years as groups of samples, then you'll need strata and you'll need balance at the year level, which you don't have.
What you have defined with your call to how() is:
groups samples by year and as such samples will never be swapped between years, and
samples within the levels of year will be permuted in series, keeping their temporal order intact after applying cyclic shift permutations.
If that's not what you want to do, you need to explain in words what you want to do. By "do" I mean what is it you want to test? What is your model in vegan?

Plotting different mixture model clusters in the same curve

I have two sets of data, one representing a healthy data set having 4 variables and 11,000 points and another representing a faulty set having 4 variables and 600 points. I have used R's package MClust to obtain GMM clustering for each data set separately. What I want to do is to obtain both clusters in the same frame so as to study them at the same time. How can that be done?
I have tried joining both the datasets but the result I am obtaining is not what I want.
The code in use is:
Dat4M <- Mclust(Dat3, G = 3)
Dat3 is where I am storing my dataset, Dat4M is where I store the result of Mclust. G = 3 is the number of Gaussian mixtures I want, which in this case is three. To plot the result, the following code is used:
plot(Dat4M)
The following is obtained when I apply the above code in my Healthy dataset:
The following is obtained when the above code is used on Faulty dataset:
Notice that in the faulty data density curve, consider the mixture of CCD and CCA, we see that there are two density points that have been obtained. Now, I want to place the same in the same block in the healthy data and study the differences.
Any help on how to do this will be appreciated.

Apply k-means to examine differences between two groups in R

I have two groups. The treatment group is exposure to media; the control group is no media. They are distinguished by a categorial variable in the data frame. (exposure to media = 1, no media = 0)
Now, I want to examine whether there are any clear differences between these two groups. To do this, apply the k-means algorithm with two clusters to four variables (proportion of black population, proportion of male population, proportion of hispanic population, median income on the logarithmic scale).
How to do this in R? Could anyone give some hints? Thanks!
Try this:
km <-kmeans(your data, 2, nstart=10)
your data here as a data.frame (your whole data or you can select the variables that you are interesting about them). You need to select the number of clusters (here is 2). A good practice to understand your data is to apply different number of cluster and then see which one fit your data better (use for example any criteria methods such as AIC or BIC).
k-means is an approach applied to cluster data. Where this data come from different distribution and we would like to know from where each observation come from (from which distribution).
You can also have a look at many tutorials about kmeans in R. For example,
https://onlinecourses.science.psu.edu/stat857/node/125
https://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/
http://www.statmethods.net/advstats/cluster.html

cluster ordinal data

I want to do clustering of my data (kmeans or hclust) in R language (coding). My data is ordinal, which means that the data is Likert scale to measure the causes of cost escalation (I have 41 causes "variables") that scaled from 1 to 5, which 1 is no effect to 5 major effect (I have about 160 observations "who rank the causes")... any help of how to cluster the 41 cause based on the observations ... do I have to convert the scale to percentage or z score before clustering or any thing that help ...... I really need your help!! here is the data to play with https://docs.google.com/spreadsheet/ccc?key=0AlrR2eXjV8nXdGtLdlYzVk01cE96Rzg2NzRpbEZjUFE&usp=sharing
I want to cluster the variables (the columns) in terms of similarity of occurrence in observations... I follow the code in statmethods.net/advstats/cluster.html; but I couldn't cluster the variables (the columns) in terms of similarity of occurrence in observations and also I follow the work at mattpeeples.net/kmeans.html#help; but I don't know why he convert the data to percentage and then to Z-score standardize.
It isn't clear to me if you want to cluster the rows (the observations) in terms of similarity in the variables, or cluster the variables (the columns) in terms of similarity of occurrence in observations?
Anyway, see package cluster. This is a recommended package that comes with all R installations.
Read ?daisy for details of what is done with ordinal data. This metric can be used in functions such as agnes (for hierarchical clustering) or pam (for partitioning about medoids, a more robust version of k-means).
By default, these will cluster the rows/observations. Simply transpose the data object using t() if you want to cluster the columns (variables). Although that may well mess up the data depending on how you have stored them.
Converting the data to percentage is called normalization of data so all the variables are in the range of 0 - 1.
If data is not normalized you run the risk of bias towards dimensions with large values

Resources