I would like to use the cutree() function to cluster a phylogenetic tree into a specified number of clades. However, the phylo object (an unrooted phylogenetic tree) is not unltrametric and thus returns an error when using as.hclust.phylo(). The goal is to sub-sample tips of a tree while retaining maximum diversity, hence the desire to cluster by a specified number of clades (and then randomly sample one from each clade). This will be done for a number of trees with varying numbers of desired samples. Any help in coercing the unrooted tree into an hclust object, or a suggestion as to a different method of systematically collapsing the trees (phylo objects) into a predefined number of clades would be greatly appreciated.
library("ape")
library("ade4")
tree <- rtree(n = 32)
tree.hclust <- as.hclust.phylo(tree)
Returns:
"Error in as.hclust.phylo(tree) : the tree is not ultrametric"
If I make a distance matrix of the brach lengths between all nodes, I am able to use hclust to generate clusters and subsequently cutree into the desired number of clusters:
dm <- cophenetic.phylo(tree)
single <- hclust(as.dist(dm), method="single")
cutSingle <- as.data.frame(cutree(single, k=10))
color <- cutSingle[match(tree$tip.label, rownames(cutSingle)), 'cutree(single, k = 10)']
plot.phylo(tree, tip.color=color)
However, the results are not desirable because very basal branches get clustered together. Basing the clustering on the tree structure, or the tip to root distance would be more desirable.
Any suggestions are appreciated!
I don't know if it's what you want, but firstly you have to use chronos(),
Here's an answer that could help you out:
How to convert a tree to a dendrogram in R?
Related
Somewhat related to this question.
I am using R subspace package for subspace clustering. As in the question above, I have failed to use the generic plotting method to plot out my resulting clusters in a way native to the package. The next step is to understand the output of the command
CLIQUE(df, xi = 40, tau = 0.2)
That looks something like this:
I understand that the "object" is the row number for the clustered unit, and the subspace indicates the dimensions of the data in which the clustering was done. However I don't see how the clusters in the given dimensions can be distinguished.
The documentation does not contain information on the output. Ideally, my goal is to plot out all the clusters with something like ggplot2, or in 3D, what have you. And I need to know which units are in which clusters in corresponding dimensions.
Additionally, checked if the dimensions of any of the two members of the output list are the same like this:
cluster_result <- clique_model
equalities_matrix <- matrix(
0L, nrow = length(cluster_result), ncol = length(cluster_result)
)
for (i in 1:length(cluster_result)){
for (j in 1:length(cluster_result)){
equalities_matrix[i,j] <- (
all(cluster_result[[i]]$subspace == cluster_result[[j]]$subspace)
)
}
}
sum(equalities_matrix)
The answer is no.
So, here is what my researches lead to, might be helpful for someone in the future.
Above, the CLIQUE algorithm above outputted one cluster per every set of dimensions, possibly by the virtue of the data or tuning of the algorithm. I added more features to the data, ran it again, and checked if the dimensions of any of the two members of the output list are the same again. This time, yes, several dimensions were the same as the sum(equalities_matrix) yielded a number larger than the number of features.
In conclusion, the output of the algorithm is a list of lists where each member list represents one cluster in one subspace:
subspace ... the dimensions making the subspace indicated with TRUE,
objects ... members of the cluster.
If there is more than one cluster in a given subspace, there will be more member lists with the same subspace, and different members of the cluster.
Here are the papers that helped me understand the theory:
Parsons, Haque, and Liu; 2004
Agrawal Gehrke, Gunopulos, Raghavan
I'm trying to conduct a hierarchical agglomerative cluster analysis in R by using the Weighted Cluster package. Before doing so, I calculated the distances between state sequences by leveraging the TraMineR package (see pp. 4-6 here).
Following the vignette hyperlinked above, I fed my distance matrix into hclust while adding a vector of weights as follows (datadist is the distance matrix; dataframe is my data frame featuring time series data; and weight is an all-waves longitudinal survey weight):
Cluster <- hclust(as.dist(datadist), method = "ward", members = dataframe$weight)
Then, after arriving at a specific cluster solution (four subgroups), I used the cutree function to determine the relative frequency of each cluster and assign cases:
subgroups <- cutree(Cluster, k = 4)
However, I somehow generated more than four groups after executing the code above (over 30, in fact). When I removed the vector of weights, I was able to produce frequencies for four clusters, but unweighted results are sub-optimal.
If anyone out there can help me understand what's going on (and how I can address or treat the problem), it would be greatly appreciated.
I have a large matrix of 500K observations to cluster using hierarchical clustering. Due to the large size, i do not have the computing power to calculate the distance matrix.
To overcome this problem I chose to aggregate my matrix to merge those observations which were identical to reduce my matrix to about 10K observations. I have the frequency for each of the rows in this aggregated matrix. I now need to incorporate this frequency as a weight in my hierarchical clustering.
The data is a mixture of numerical and categorical variables for the 500K observations so i have used the daisy package to calculate the gower dissimilarity for my aggregated dataset. I want to use hclust in the stats package for the aggregated dataset however i want to take into account the frequency of each observation. From the help information for hclust the arguments are as follows:
hclust(d, method = "complete", members = NULL)
The information for the members argument is:, NULL or a vector with length size of d. See the ‘Details’ section. When you look at the details section you get: If members != NULL, then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means.
From the above description, i am unsure if i can assign my frequency weights to the members arguments as it is not clear if this is the purpose of this argument. I would like to use it like this:
hclust(d, method = "complete", members = df$freq)
Where df$freq is the frequency of each row in the aggregated matrix. So if a row is duplicated 10 times this value would be 10.
If anyone can help me that would be great,
Thanks
Yes, this should work fine for most linkages, in particular single, group average and complete linkage. For ward etc. you need to correctly take the weights into account yourself.
But even that part is not hard. Just make sure to use the cluster sizes, because you need to pass the distance of two clusters, not two points. So the matrix should contain the distance of n1 points at location x and n2 points at location y. For min/max/mean this n disappears or cancels out. For ward, you should get a SSQ like formula.
Here I have some problem to find number of clusters after using cutree on a dendrogram. Here is my approach:
mat <- a huge matrix
hc <- (as.dist(mat), method = "average", members = NULL)
#to cut the tree just 1 level below the maximum height
tree <- cutree(hc, h = hc$height[[length(hc$height)-1]])
By printing the tree variable I can see that my dendrogram is cut into two clusters. I can also get the labels from each cluster using names(tree[tree==1]), but how can get the number of clusters without looking at the data? I want to automate this in a pipeline based on number of clusters it has in tree variable.
finally i made it to answer my question by running a loop over the tree object after cutting dendrogram, but this might not be an optimal solution. and feel free to suggest modifications to make it more elegant..
clust <- c()
for (i in 1:length(tree)){
clust[i] <- tree[[i]]
}
length(unique(clust))
This should possibly give the answer as per my knowledge..
Thank you
I have a phylo file in NEWICK format with a few distance of a very little length (aprox. of 1.042e-06) and I need to "eliminate" these little distances.
I have thought to multiply all distances by 10, because for what I further need the tree this multiplication do not produce any effect.
For doing that I have found the ape package in R and the function compute.brlen, as with this function you can change the length of the branches by a function.
Any idea on how to multiply the length of the branches by 10 with this function?
I have tried to do compute.brlen(tree, main=expression(rho==10)), but I think this is incorrect for what I want.
Try to do this:
require(ape) # get the ape package
mytree = rtree(10) # make a random tree, you should have instead read.tree(path_to_tree_file)
mytree$edge.length = mytree$edge.length * 10 #or any other scalar that you want
Keep in mind that this will scale all the branch lengths in the phylogeny.