Displaying hierarchical clusters at cluster level (without cases) - r

I am interested in visualizing the results of a hierarchical cluster analysis. Is it possible to use a dendrogram to display the names or labels of clusters (and subclusters) without displaying the original cases that went into the cluster analysis?
For example, this code applies a hierarchical cluster analysis to the mtcars dataset.
data("mtcars")
clust <- hclust(get_dist(mtcars, method = "pearson"), method = "complete")
plot(clust)
Let's say I cut the tree at 4 clusters and rename the clusters "sedan", "truck", "sportscar", and "van" (totally arbitrary labels).
clust1 <- cutree(clust,4)
clust1 <- dplyr::recode(clust1,
'1'='sedan',
'2'='truck',
'3'='sportscar',
'4'='van')
Is it possible to display a dendrogram which shows these four labels as the nodes on the bottom of the tree, suppressing the names of the original car names?
I am also interested in displaying subclusters within clusters in a similar way, but that may be outside the scope of this question. Bonus points if you can also give a suggestion for how to display subclusters within clusters in a dendrogram while suppressing the names of the original cases! :)
Thank you in advance!

Yes, you can do this. I do not understand your get_dist so I will illustrate using the ordinary distance dist.
data("mtcars")
clust <- hclust(dist(mtcars), method = "complete")
To cut off and display just the top of the tree, change it to a dendrogram and use upper. But you need to know what to height to cut it at. That is in the structure clust.
tail(clust$height)
[1] 113.3023 134.8119 141.7044 214.9367 261.8499 425.3447
Since you want four branches, you can cut at any height between the third and fourth heights (from the end). I will use 213.
MTC_Dend = as.dendrogram(clust)
TreeTop = cut(MTC_Dend, h = 213)$upper
You can get the basic plot now with plot(TreeTop), but it won't have the labels that you want. To change the labels, use the package dendextend which offers a tool specifically to change the labels.
library("dendextend")
labels(TreeTop) = c('sedan','truck', 'sportscar', 'van')
plot(TreeTop)

Related

PCA Biplot Make Readable

I am working with California Housing Dataset. The dataset has 20640 observations and 10 attributes. I am using R to make biplot but the figure I obtained is not very readable. The output is as followenter image description here
I am using a simple code to make this output.
biplot(housingpr,scale = 0)
Is there anyway to make this biplot look readble.
There is not much you can do if you want to plot 20,640 observations except to make the points smaller. Here is an example with the iris data:
data(iris)
iris.pca <- prcomp(iris[, -5], scale.=TRUE)
biplot(iris.pca, xlabs=rep("*", nrow(iris)), cex=.75)
The xlabs= argument sets the text for each point with the default value being the row number. This replaces the default with an asterisk for each value. If there are still too many points you can replace the asterisk with a period. The cex= argument controls the size of the labels with the default value of 1 being full-size.

Different visualization for hierarchical clustering of dendrogram

I would like to have visualization of hierarchical clustering with shapes one inside the other. Brightness level represents level of hierarchy.
Let me show you my idea with an example:
# Clustering small proportion of iris data
clusters <- hclust(dist(iris[20:28, 3:4]), method = 'average')
# Visualizing the result as a dendogram
plot(clusters)
Now we can convert the dendrogram as below.
Is there any R package that can produce something similar?
This is only a partial answer. You can use clusplot from the cluster package to get some way in that direction. You could probably improve on this by changing the source of clusplot (type getAnywhere(clusplot.default) to get the source). But it is probably some work to get your bubbles to not overlap. Anyway, here's the plot you get from clusplot. It may also be of interest to look at the individual plots one at a time instead of showing them all together.
# use sample data
df <- iris[20:28, 3:4]
# calculate hierarchical clustering
hfit <- hclust(dist(df), method = 'average')
# plot dendogram
plot(hfit)
# use clusplot at all possible cutoffs and show on top of each other.
library(cluster)
clusplot(df, cutree(hfit, 1), lines = 0)
for (i in 2:nrow(df)){
clusplot(df, cutree(hfit, i), lines = 0, add = TRUE)
}

Different color for different cluster in a tree using adegenet R package

I am using the R package adegenet to plot the neighbor-joining tree.
In my file I have 20,000 columns and 500 rows. Rows correspond to individuals. My first column is Population ID and second column is Individual ID. Columns contain values 0,1 & 2. I am able to plot a tree in one color, but depending upon the population I want every cluster to be a different color.
This is what I did, If "dat" is my data file,then
D<-dist(as.matrix(dat))
tre<-nj(D)
plot(tre, type = "unr", show.tip.lab = TRUE, cex=0.3, font=1, edge.col="Blue")
If I try edge.col=c("red","green","blue") I run into following error :
Error in if (use.edge.length) unrooted.xy(Ntip, Nnode, z$edge, z$edge.length, :
argument is not interpretable as logical
Ill appreciate any help!
Your example should be reproducible, so that it would be easier to help and reproduce your problem. See this post for more details. I'm trying with iris and it works like a charm. By the way, I think adegenet is not required here, the plot is actually a plot.phylo from the package ape), and all other functions are either built-in or from ape).
Documentation (?plot.phylo) says:
edge.col a vector of mode character giving the colours used to draw the branches of the plotted phylogeny. These are taken to be in the same order than the component edge of phy. If fewer colours are given than the length of edge, then the colours are recycled.
ape preserves the order or rows, and you can use a factor to index you vector of colors, so a reproducible example using iris could be:
library(ape)
D <-dist(as.matrix(iris[, 1:4]))
tree <- nj(D)
plot(tree, type = "unr", show.tip.lab = TRUE, cex=0.3, font=1,
edge.col=c("red","green","blue")[iris$Species])
Is that what you want?

PCA biplot one variables shown R

I ran a pca on a set of 45000 genes on 5 different samples, and when I perform a biplot, all I see is a mass of text (responding to the observation names), and cannot see the location of my samples. Is there a way to plot the location of the samples only, and not the observation, in a biplot?
Using built in data from R
usa <- USArrests
pca1 <- prcomp(usa)
biplot(pca1)
This generates a biplot where all the states (observation names) overlap the variables (my different samples) rape, etc. Is it possible to plot only the variables (samples), and not the states (observation names)?
biplot.default uses text to write the categorical variable name of the observation. As it doesn't use points you need to modify the source if you only want the points (and not the labels) to be plotted.
However, you could "hack" it by doing something like:
biplot(pca1, xlabs = rep(".", nrow(usa)))
I hope this is what you're looking for!
Edit If this is not satisfactory, you can modify the source given when running stats:::biplot.default to use points.

Clustering and heatmap in R

I am a newbie to R and I am trying to do some clustering on a data table where rows represent individual objects and columns represent the features that have been measured for these objects. I've worked through some clustering tutorials and I do get some output, however, the heatmap that I get after clustering does not correspond at all to the heatmap produced from the same data table with another programme. While the heatmap of that programme does indicate clear differences in marker expression between the objects, my heatmap doesn't show much differences and I cannot recognize any clustering (i.e., colour) pattern on the heatmap, it just seems to be a randomly jumbled set of colours that are close to each other (no big contrast). Here is an example of the code I am using, maybe someone has an idea on what I might be doing wrong.
mydata <- read.table("mydata.csv")
datamat <- as.matrix(mydata)
datalog <- log(datamat)
I am using log values for the clustering because I know that the other programme does so, too
library(gplots)
hr <- hclust(as.dist(1-cor(t(datalog), method="pearson")), method="complete")
mycl <- cutree(hr, k=7)
mycol <- sample(rainbow(256)); mycol <- mycol[as.vector(mycl)]
heatmap(datamat, Rowv=as.dendrogram(hr), Colv=NA,
col=colorpanel(40, "black","yellow","green"),
scale="column", RowSideColors=mycol)
Again, I plot the original colours but use the log-clusters because I know that this is what the other programme does.
I tried to play around with the methods, but I don't get anything that would at least somehow look like a clustered heatmap. When I take out the scaling, the heatmap becomes extremely dark (and I am actually quite sure that I have somehow to scale or normalize the data by column). I also tried to cluster with k-means, but again, this didn't help. My idea was that the colour scale might not be used completely because of two outliers, but although removing them slightly increased the range of colours plotted on the heatmap, this still did not reveal proper clusters.
Is there anything else I could play around with?
And is it possible to change the colour scale with heatmap so that outliers are found in the last bin that has a range of "everything greater than a particular value"? I tried to do this with heatmap.2 (argument "breaks"), but I didn't quite succeed and also I didn't manage to put the row side colours that I use with the heatmap function.
If you are okay with using heatmap.2 from the gplots package that will allow you to add breaks to assign colors to ranges represented in your heatmap.
For example if you had 3 colors blue, white, and red with the values going from low to high you could do something like this:
my.breaks <- c(seq(-5, -.6, length.out=6),seq(-.5999999, .1, length.out=4),seq(.100009,5, length.out=7))
result <- heatmap.2(mtscaled, Rowv=T, scale='none', dendrogram="row", symm = T, col=bluered(16), breaks=my.breaks)
In this case you have 3 sets of values that correspond to the 3 colors, the values will differ of course depending on what values you have with your data.
One thing you are doing in your program is to call hclust on your data then to call heatmap on it, however if you look in the heatmap manual page it states:
Defaults to hclust.
So I don't think you need to do that. You might want to take a look at some similar questions that I had asked that might help to point you in the right direction:
Heatmap Question 1
Heatmap Question 2
If you post an image of the heatmap you get and an image of the heatmap that the other program is making it will be easier for us to help you out more.

Resources