Creating a dendrogram with the results from the results the multipatt function in the indicspecies package - r

I am getting familiar with the multipatt function in indicspecies package. Thus far I only see summary being used to give a breakdown of the results. However I would like a dendrogram, ideally with the names of the species which are more 'indicative' of my given community location.
example from package file:
library(indicspecies)
library(stats)
data(wetland) ## Loads species data
wetkm = kmeans(wetland, centers=3) ## Creates three clusters using kmeans
## Runs the combination analysis using IndVal.g as statistic
wetpt = multipatt(wetland, wetkm$cluster, control = how(nperm=999))
## Lists those species with significant association to one combination
summary(wetpt)
wetpt gives the raw results but I am not sure how to proceed to get a cluster plot out of this result. Can anyone offer any pointers?

Related

Species scores not available as result of metaMDS

I am conducting an ordination analysis in R and I'm having trouble dealing with the results of the function metaMDS, from vegan. More specifically, I'm not able to get species scores from the metaMDS result. I found a similar question here How to get 'species score' for ordination with metaMDS()?, but the answer for it did not work for me.
My data is available here: https://drive.google.com/file/d/1btKzAWL_fmJ80GjcgMnwX5m_ls8h8vpY/view?usp=sharing
Here is the code I wrote so far:
library(vegan)
mydata <- read.table("nmdsdata.txt", h=T, row.names=1)
dist.f <- vegdist(mydata, method = "bray")
dist.f2 <- stepacross(dist.f, path = "extended")
results<-metaMDS(dist.f2, trymax=500)
I was told to use the function stepacross since the species composition of my sites is quite different and I was not getting converging results in metaMDS without using it. So far so good.
My problem arises when I try to plot the results:
plot(results, type="t")
When I run that line I receive the following message on my console: "species scores not available". I tried to follow the approach recommended on the link I cited earlier, by running the code:
NMDS<-metaMDS(dist.f2,k=2,trymax=500, distance = "bray")
envfit(NMDS, dist.f2)
However, it does not seem to work here. It returns me the sites scores and not the species scores as it does when using the data from the post I commented earlier. The only difference in my code is that I use "bray" and not "euclidean" distance. In addition, I get a warning after running envfit: "Warning message: In complete.cases(X) & complete.cases(env): longer object length is not a multiple of shorter object length".
I need both sites scores and species scores to plot the results. Am I missing something here? I'm new to R, so consider that, please,
Any help would be appreciated. Thanks in advance.
If you want to have species scores, you must supply information on species. Dissimilarities do not have any information on species, and therefore they won't work. You have two obvious (and documented) alternatives:
Supply a data matrix to metaMDS and it will calculate the dissimilarities, and if requested, also perform the stepacross:
## bray dissimilarities is the default, noshare triggers stepacross
NMDS <- metaMDS(mydata, noshare = TRUE, autotransform = FALSE, trymax = 500)
This also turns off automatic data transformation as you did not use it in your dissimilarities either.
Add species scores to the result after the analysis when you did not supply information on species:
NMDS <- metaMDS(dist.f2, trymax = 500)
sppscores(NMDS) <- mydata
If you transformed community data, you should use similar transformation for mydata in sppscores.

Efficient plotting of part of a hierarchical cluster

I am running agglomerative clustering on a data set of 130K rows (130K unique keys) and 7 columns, each column ranging from 20 to 2000 unique levels. The data are categorical, specifically alphanumeric codes. At most they can be thought of as factors. I am experimenting with what results I might get from a couple of alternatives to k-modes, including hierarchical clustering and MCA.
My question is, is there any good way to visualize the results up to a certain level with the tree structure?
Standard steps are not a problem:
library{cluster}
Compute Gower distance,
ptm <- proc.time()
gower.dist <- daisy(df[,colnams], metric = c("gower"))
elapsed <- proc.time() - ptm
c(elapsed[3],elapsed[3]/60)
Compute agglomerative clustering object from Gower distance
aggl.clust.c <- hclust(gower.dist, method = "complete")
Now to plotting it. The following line works, but the plot is humanly unreadable
plot(aggl.clust.c, main = "Agglomerative, complete linkages")
Ideally what I am looking for would be something like so (the below is pseudocode that failed on my system)
plot(cutree(aggl.clust.c, k=7), main = "Agglomerative, complete linkages")
I am running R version 3.2.3. That version cannot change (and I don't believe it ought to make a difference for what I am trying to do).
I'd be interested in doing the same in Python, if anyone has good pointers.
I found a useful answer to my question re plotting part of a tree using the as.dendogram() method. Link: http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning

Fuzzy C-Means Clustering in R

I am performing Fuzzy Clustering on some data. I first scaled the data frame so each variable has a mean of 0 and sd of 1. Then I ran the clValid function from the package clValid as follows:
library(dplyr)
df<-iris[,-5] # I do not use iris, but to make reproducible
clust<-sapply(df,scale)
intvalid <- clValid(clust, 2:10, clMethods=c("fanny"),
validation="internal", maxitems = 1000)
The results told me 4 would be the best number of clusters. Therefore I ran the fanny function from the cluster package as follows:
res.fanny <- fanny(clust, 4, metric='SqEuclidean')
res.fanny$coeff
res.fanny$k.crisp
df$fuzzy<-res.fanny$clustering
profile<-ddply(df,.(fuzzy),summarize,
count=length(fuzzy))
However, in looking at the profile, I only have 3 clusters instead of 4. How is this possible? Should I go with 3 clusters than instead of 4? How do I explain this? I do not know how to re create my data because it is quite large. As anybody else encountered this before?
This is an attempt at an answer, based on limited information and it may not fully address the questioners situation. It sounds like there may be other issues. In chat they indicated that they had encountered additional errors that I can not reproduce. Fanny will calculate and assign items to "crisp" clusters, based on a metric. It will also produce a matrix showing the fuzzy clustering assignment that may be accessed using membership.
The issue the questioner described can be recreated by increasing the memb.exp parameter using the iris data set. Here is an example:
library(plyr)
library(clValid)
library(cluster)
df<-iris[,-5] # I do not use iris, but to make reproducible
clust<-sapply(df,scale)
res.fanny <- fanny(clust, 4, metric='SqEuclidean', memb.exp = 2)
Calling res.fanny$k.crisp shows that this produces 4 crisp clusters.
res.fanny14 <- fanny(clust, 4, metric='SqEuclidean', memb.exp = 14)
Calling res.fanny14$k.crisp shows that this produces 3 crisp clusters.
One can still access the membership of each of the 4 clusters using res.fanny14$membership.
If you have a good reason to think there should be 4 crisp clusters one could reduce the memb.exp parameter. Which would tighten up the cluster assignments. Or if you are doing some sort of supervised learning, one procedure to tune this parameter would be to reserve some test data, do a hyperparameter grid search, then select the value that produces the best result on your preferred metric. However without knowing more about the task, the data, or what the questioner is trying to accomplish it is hard to suggest much more than this.
First of all I encourage to read the nice vignette of the clValid package.
The R package clValid contains functions for validating the results of a cluster analysis. There are three main types of cluster validation measures available. One of this measure is the Dunn index, the ratio between observations not in the same cluster to the larger intra-cluster distance. I focus on Dunn index for simplicity. In general connectivity should be minimized, while both the Dunn index and the silhouette width should be maximized.
clValid creators explicitly refer to the fanny function of the cluster package in their documentation.
The clValid package is useful for running several algorithms/metrics across a prespecified sets of clustering.
library(dplyr)
library(clValid)
iris
table(iris$Species)
clust <- sapply(iris[, -5], scale)
In my code I need to increase the iteration for reaching convergence (maxit = 1500).
Results are obtained with summary function applied to the clValid object intvalid.
Seems that the optimal number of clusters is 2 (but here is not the main point).
intvalid <- clValid(clust, 2:5, clMethods=c("fanny"),
maxit = 1500,
validation="internal",
metric="euclidean")
summary(intvalid)
The results from any method can be extracted from a clValid object for further analysis using the clusters method. Here the results from the 2 clusters solution are extracted(hc$2), with emphasis on the Dunnett coefficient (hc$2$coeff). Of course this results were related to the "euclidean" metric of the clValid call.
hc <- clusters(intvalid, "fanny")
hc$`2`$coeff
Now, I simply call fanny from cluster package using euclidean metric and 2 clusters. Results are completely overlapping with the previous step.
res.fanny <- fanny(clust, 2, metric='euclidean', maxit = 1500)
res.fanny$coeff
Now, we can look at the classification table
table(hc$`2`$clustering, iris[,5])
setosa versicolor virginica
1 50 0 0
2 0 50 50
and to the profile
df$fuzzy <- hc$`2`$clustering
profile <- ddply(df,.(fuzzy), summarize,
count=length(fuzzy))
profile
fuzzy count
1 1 50
2 2 100

ANOSIM with cutree groupings

What i would like to do is an ANOSIM of defined groupings in some assemblage data to see whether the groupings are significantly different from one another, in a similar fashion to this example code:
data(dune)
data(dune.env)
dune.dist <- vegdist(dune)
attach(dune.env)
dune.ano <- anosim(dune.dist, Management)
summary(dune.ano)
However in my own data I have the species abundance in a bray-curtis matrices and after creating hclust() diagrams and creating my own groupings visually by looking at the dendrogram and setting the height. I can then through cutree() get these groupings which can be superimposed on MDS plots etc. but I would like to check the significance of the similarity between the groupings i have created - i.e are the groupings significantly different or just arbitrary groupings?
e.g.
data("dune")
dune.dist <- vegdist(dune)
clua <- hclust(dune.dist, "average")
plot(clua)
rect.hclust(clua, h =0.65)
c1 <- cutree(clua, h=0.65)
I then want to use the c1 defined category as the groupings, which in the example code given was the management factor, and test their similarities to see whether they are actually different via anosim().
I am pretty sure this is just a matter of my inept coding.... any advice would be appreciated.
cutree returns groups as integers: you must change these to factors if you want to use them in anosim: Try anosim(vegdist(dune), factor(c1)). You better contact a local statistician for using anosim to analyse dissimilarities using clusters created from these very same dissimilarities.

Using survfit object's formula in survdiff call

I'm doing some survival analysis in R, and looking to tidy up/simplify my code.
At the moment I'm doing several steps in my data analysis:
make a Surv object (time variable with indication as to whether each observation was censored);
fit this Surv object according to a categorical predictor, for plotting/estimation of median survival time processes; and
calculate a log-rank test to ask whether there is evidence of "significant" differences in survival between the groups.
As an example, here is a mock-up using the lung dataset in the survival package from R. So the following code is similar enough to what I want to do, but much simplified in terms of the predictor set (which is why I want to simplify the code, so I don't make inconsistent calls across models).
library(survival)
# Step 1: Make a survival object with time-to-event and censoring indicator.
# Following works with defaults as status = 2 = dead in this dataset.
# Create survival object
lung.Surv <- with(lung, Surv(time=time, event=status))
# Step 2: Fit survival curves to object based on patient sex, plot this.
lung.survfit <- survfit(lung.Surv ~ lung$sex)
print(lung.survfit)
plot(lung.survfit)
# Step 3: Calculate log-rank test for difference in survival objects
lung.survdiff <- survdiff(lung.Surv ~ lung$sex)
print(lung.survdiff)
Now this is all fine and dandy, and I can live with this but would like to do better.
So my question is around step 3. What I would like to do here is to be able to use information in the formula from the lung.survfit object to feed into the calculation of the differences in survival curves: i.e. in the call to survdiff. And this is where my domitable [sic] programming skills hit a wall. Below is my current attempt to do this: I'd appreciate any help that you can give! Once I can get this sorted out I should be able to wrap a solution up in a function.
lung.survdiff <- survdiff(parse(text=(lung.survfit$call$formula)))
## Which returns following:
# Error in survdiff(parse(text = (lung.survfit$call$formula))) :
# The 'formula' argument is not a formula
As I commented above, I actually sorted out the answer to this shortly after having written this question.
So step 3 above could be replaced by:
lung.survdiff <- survdiff(formula(lung.survfit$call$formula))
But as Ben Barnes points out in the comment to the question, the formula from the survfit object can be more directly extracted with
lung.survdiff <- survdiff(formula(lung.survfit))
Which is exactly what I wanted and hoped would be available -- thanks Ben!

Resources