I'm using the pvclust package in R to perform bootstrapped hierarchical clustering. The output is then plotted as a hclust object with a few extra features (different default title, p-values at nodes). I've attached a link to one of the plots here.
This plot is exactly what I want, except that I need the leaf labels to be displayed horizontally instead of vertically. As far as I can tell there isn't an option for rotating the leaf labels in plot.hclust. I can plot the hclust object as a dendrogram
(i.e. plot(as.dendrogram(example$hclust), leaflab="textlike") instead of plot(example))
but the leaf labels are then printed in boxes that I can't seem to remove, and the heights of the nodes in the hclust object are lost. I've attached a link to the dendrogram plot here.
What would be the best way to make a plot that is as similar as possible to the standard plot.pvclust() output, but with horizontal leaf labels?
One way to get the text the way you want is to have plot.dendrogram print nothing and just add the labels yourself. Since you don't provide your data, I illustrate with some built-in data. By default, the plot was not leaving enough room for the labels, so I set the ylim to allow the extra needed room.
set.seed(1234)
HC = hclust(dist(iris[sample(150,6),1:4]))
plot(as.dendrogram(HC), leaflab="none", ylim=c(-0.2, max(HC$height)))
text(x=seq_along(HC$labels), y=-0.2, labels=HC$labels)
I've written a function that plots the standard pvclust plot with empty strings as leaf labels, then plots the leaf labels separately.
plot.pvclust2 <- function(clust, x_adj_val, y_adj_val, ...){
# Assign the labels in the hclust object to x_labels,
# then replace x$hclust$labels with empty strings.
# The pvclust object will be plotted as usual, but without
# any leaf labels.
clust_labels <- clust$hclust$labels
clust$hclust$labels <- rep("", length(clust_labels))
clust_merge <- clust$hclust$merge #For shorter commands
# Create empty vector for the y_heights and populate with height vals
y_heights <- numeric(length = length(clust_labels))
for(i in 1:nrow(clust_merge)){
# For i-th merge
singletons <- clust_merge[i,] < 0 #negative entries in merge indicate
#agglomerations of singletons, and
#positive entries indicate agglomerations
#of non-singletons.
y_index <- - clust_merge[i, singletons]
y_heights[y_index] <- clust$hclust$height[i] - y_adj_val
}
# Horizontal text can be cutoff by the margins, so the x_adjust moves values
# on the left of a cluster to the right, and values on the right of a cluster
# are moved to the left
x_adjust <- numeric(length = length(clust_labels))
# Values in column 1 of clust_merge are on the left of a cluster, column 2
# holds the right-hand values
x_adjust[-clust_merge[clust_merge[ ,1] < 0, 1]] <- 1 * x_adj_val
x_adjust[-clust_merge[clust_merge[ ,2] < 0, 2]] <- -1 * x_adj_val
# Plot the pvclust object with empty labels, then plot horizontal labels
plot(clust, ...)
text(x = seq(1, length(clust_labels)) +
x_adjust[clust$hclust$order],
y = y_heights[clust$hclust$order],
labels = clust_labels[clust$hclust$order])
}
Related
I need to make a histogram for my variable which is 'travel time'. And inside that, I need to plot the regression(correlation) data i.e. my observed data vs predicted. And I need to repeat it for different time of day and week(in simple words, make a matrix of such figure using par function). for now, I can draw histograms and arrange that in matrix form but I am facing a problem in inside plot (plotting x and y data together with y=x line, and arranging them within their consecutive histograms plot, in a matrix ). How can I do that, as in the figure below. Any help would be appreciated. Thanks!
One way to do this is to loop over your data and on every iteration create a desired plot. Here is one not very polished example, but it shows the logic how plotting a small plot over larger plot can be done. You will have to tweak the code to get it work in the way you need, but it shouldn't be that difficult.
# create some sample dataset (your x values)
a <- c(rnorm(100,0,1))
b <- c(rnorm(100,2,1))
# create their "y" values counterparts
x <- a + 3
y <- b + 4
# bind the data into two dataframes (explanatory variables in one, explained in the other)
data1 <- cbind(a,b)
data2 <- cbind(x,y)
# set dimensions of the plot matrix
par(mfrow = c(2,1))
# for each of the explanatory - explained pair
for (i in 1:ncol(data2))
{
# set positioning of the histogram
par("plt" = c(0.1,0.95,0.15,0.9))
# plot the histogram
hist(data1[, i])
# set positioning of the small plot
par("plt" = c(0.7, 0.95, 0.7, 0.95))
# plot the small plot over the histogram
par(new = TRUE)
plot(data1[, i], data2[, i])
# add some line into the small plot
lines(data1[, i], data1[, i])
}
Is there any way for me to add some points to a pairs plot?
For example, I can plot the Iris dataset with pairs(iris[1:4]), but I wanted to execute a clustering method (for example, kmeans) over this dataset and plot its resulting centroids on the plot I already had.
It would help too if there's a way to plot the whole data and the centroids together in a single pairs plot in such a way that the centroids can be plotted in a different way. The idea is, I plot pairs(rbind(iris[1:4],centers) (where centers are the three centroids' data) but plotting the three last elements of this matrix in a different way, like changing cex or pch. Is it possible?
You give the solution yourself in the last paragraph of your question. Yes, you can use pch and col in the pairs function.
pairs(rbind(iris[1:4], kmeans(iris[1:4],3)$centers),
pch=rep(c(1,2), c(nrow(iris), 3)),
col=rep(c(1,2), c(nrow(iris), 3)))
Another option is to use panel function:
cl <- kmeans(iris[1:4],3)
idx <- subset(expand.grid(x=1:4,y=1:4),x!=y)
i <- 1
pairs(iris[1:4],bg=cl$cluster,pch=21,
panel=function(x, y,bg, ...) {
points(x, y, pch=21,bg=bg)
points(cl$center[,idx[i,'x']],cl$center[,idx[i,'y']],
cex=4,pch=10,col='blue')
i <<- i +1
})
But I think it is safer and easier to use lattice splom function. The legend is also automatically generated.
cl <- kmeans(iris[1:4],3)
library(lattice)
splom(iris[1:4],groups=cl$cluster,pch=21,
panel=function(x, y,i,j,groups, ...) {
panel.points(x, y, pch=21,col=groups)
panel.points(cl$center[,j],cl$center[,i],
pch=10,col='blue')
},auto.key=TRUE)
Imagine we have 7 categories (e.g. religion), and we would like to plot them not in a linear way, but in clusters that are automatically chosen to be nicely aligned. Here the individuals within groups have the same response, but should not be plotted on one line (which happens when plotting ordinal data).
So to sum it up:
automatically using available graph space
grouping without order, spread around canvas
individuals remain visible; no overlapping
would be nice to have the individuals within groups to be bound by some (invisible) circle
Are there any packages designed for this purpose? What are keywords I need to look for?
Example data:
religion <- sample(1:7, 100, T)
# No overlap here, but I would like to see the group part come out more.
plot(religion)
After assigning coordinates to the center of each group,
you can use wordcloud::textplot to avoid overlapping labels.
# Data
n <- 100
k <- 7
religion <- sample(1:k, n, TRUE)
names(religion) <- outer(LETTERS, LETTERS, paste0)[1:n]
# Position of the groups
x <- runif(k)
y <- runif(k)
# Plot
library(wordcloud)
textplot(
x[religion], y[religion], names(religion),
xlim=c(0,1), ylim=c(0,1), axes=FALSE, xlab="", ylab=""
)
Alternatively, you can build a graph with a clique (or a tree)
for each group,
and use one of the many graph-layout algorithms in igraph.
library(igraph)
A <- outer( religion, religion, `==` )
g <- graph.adjacency(A)
plot(g)
plot(minimum.spanning.tree(g))
In the image you linked each point has three numbers associated: coordinates x and y and group (color). If you only have one information for each individual, you can do something like this:
set.seed(1)
centers <- data.frame(religion=1:7, cx=runif(7), cy=runif(7))
eps <- 0.04
data <- within(merge(data.frame(religion=sample(1:7, 100, T)), centers),
{
x <- cx+rnorm(length(cx),sd=eps)
y <- cy+rnorm(length(cy),sd=eps)
})
with(data, plot(x,y,col=religion, pch=16))
Note that I'm creating random centers for each group and also creating small displacements around these centers for each observation. You'll have to play around with parameter eps and maybe set the centers manually if want to pursue this path.
In below kmeans analysis I am assigning a 1 or 0 to indicate if word is associated with a user :
cells = c(1,1,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,1,1,1,1,1,1)
rnames = c("a1","a2","a3","a4","a5","a6","a7","a8","a9")
cnames = c("google","so","test")
x <- matrix(cells, nrow=9, ncol=3, byrow=TRUE, dimnames=list(rnames, cnames))
# run K-Means
km <- kmeans(x, 3, 15)
# print components of km
print(km)
# plot clusters
plot(x, col = km$cluster)
# plot centers
points(km$centers, col = 1:2, pch = 8)
This is the graph :
Why do I not receive multiple points around each cluster ? What is this graph indicating. I would like to suggest a word to a user depending on if another use has the same word configured.
You don't see multiple points because your data are discrete, categorical observations. K-means is really only suitable for grouping continuous observations. Your data can only appear on three points on the plot you've shown and three points don't make a nice "cloud" of data.
This suggests to me that k-means is probably not appropriate for your specific problem.
Incidentally, when I run the code above, I get the plot below, which is different from the one you've shown us. Perhaps this is more like what you are expecting? The green green data point belongs to (is "around") the upper-right cluster centre indicated by a black asterisk.
I have a distance matrix for ~20 elements, which I am using to do hierarchical clustering in R. Is there a way to label elements with a plot or a picture instead of just numbers, characters, etc?
So, instead of the leaf nodes having numbers, it'd have small plots or pictures.
Here is why I'm interested in this functionality. I have 2-D scatterplots like these (color indicates density)
http://www.pnas.org/content/108/51/20455/F2.large.jpg (Note that this is not my own data)
I have to analyze hundreds of such 2-D scatter plots, and am trying out various distance metrics which I'm feeding on to hclust. The idea is to quickly (albeit roughly) cluster the 2-D plots to figure out the larger patterns, so we can minimize the number of time-consuming, follow-up experiments. Hence, it'll be ideal to label the dendrogram leaves with the appropriate 2-D plots.
There is one option :
Convert your hclust using as.dendrogram
use dendrapply to apply a function through the tree. The function customize the leaf.
here one example , where I color my cluster and I change the chape of the node.
hc = hclust(dist(mtcars[1:10,]))
hcd <- as.dendrogram(hc)
mycols <- grDevices::rainbow(attr(hcd,"members"))
i <- 0
colLab <- function(n) {
if(is.leaf(n)) {
i <<- i + 1
a <- attributes(n)
attr(n, "nodePar") <-
c(a$nodePar, list(lab.col = mycols[i],lab.bg='grey50',pch=sample(19:25,1)))
attr(n, "frame.plot") <- TRUE
}
n
}
clusDendro = dendrapply(hcd, colLab)
# make plot
plot(clusDendro, main = "Customized Dendrogram", type = "triangle")
Idea:
If you try to customize the node label to an map it to an url link. So when you click on the leaf name , you navigate to its image. I think it is not hard to do.