Displaying TraMineR (R) dendrograms in text/table format - r

I use the following R code to generate a dendrogram (see attached picture) with labels based on TraMineR sequences:
library(TraMineR)
library(cluster)
clusterward <- agnes(twitter.om, diss = TRUE, method = "ward")
plot(clusterward, which.plots = 2, labels=colnames(twitter_sequences))
The full code (including dataset) can be found here.
As informative as the dendrogram is graphically, it would be handy to get the same information in text and/or table format. If I call any of the aspects of the object clusterward (created by agnes), such as "order" or "merge" I get everything labeled using numbers rather than the names I get from colnames(twitter_sequences). Also, I don't see how I can output the groupings represented graphically in the dendrogram.
To summarize: How can I get the cluster output in text/table format with the labels properly displayed using R and ideally the traminer/cluster libraries?

The question concerns the cluster package. The help page for the agnes.object returned by agnes
(See http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/agnes.object.html ) states that this object contains an order.lab component "similar to order, but containing observation labels instead of observation numbers. This component is only available if the original observations were labelled."
The dissimilarity matrix (twitter.om in your case) produced by TraMineR does currently not retain the sequence labels as row and column names. To get the order.lab component you have to manually assign sequence labels as both the rownames and colnames of your twitter.om matrix. I illustrate here with the mvad data provided by the TraMineR package.
library(TraMineR)
data(mvad)
## attaching row labels
rownames(mvad) <- paste("seq",rownames(mvad),sep="")
mvad.seq <- seqdef(mvad[17:86])
## computing the dissimilarity matrix
dist.om <- seqdist(mvad.seq, method = "OM", indel = 1, sm = "TRATE")
## assigning row and column labels
rownames(dist.om) <- rownames(mvad)
colnames(dist.om) <- rownames(mvad)
dist.om[1:6,1:6]
## Hierarchical cluster with agnes library(cluster)
cward <- agnes(dist.om, diss = TRUE, method = "ward")
## here we can see that cward has an order.lab component
attributes(cward)
That is for getting order with sequence labels rather than numbers. But now it is not clear to me which cluster outcome you want in text/table form. From the dendrogram you decide of where you want to cut it, i.e., the number of groups you want and cut the dendrogram with cutree, e.g. cl.4 <- cutree(clusterward1, k = 4). The result cl.4 is a vector with the cluster membership for each sequence and you get the list of the members of group 1, for example, with rownames(mvad.seq)[cl.4==1].
Alternatively, you can use the identify method (see ?identify.hclust) to select the groups interactively from the plot, but need to pass the argument as as.hclust(cward). Here is the code for the example
## plot the dendrogram
plot(cward, which.plot = 2, labels=FALSE)
## and select the groups manually from the plot
x <- identify(as.hclust(cward)) ## Terminate with second mouse button
## number of groups selected
length(x)
## list of members of the first group
x[[1]]
Hope this helps.

Related

How to prevent TraMineR state distribution plot (seqdplot) from removing missing states

I am analysing some sequence data and wish to be able to see missing states within all of my sequence plots. However, I have noticed that TraMineR's state distribution plot function seqdplot automatically removes missing sequence states. I have included a reproducible example below. As you can see, the missing data is visible in the plot and legend of the sequence index plot seqIplot. However, it is automatically removed from the state distribution plot seqdplot.
How do I stop seqdplot from removing these missing values?
Create & Format Data
# Import required libraries
library(TraMineR)
library(tidyverse)
# Set seed for reproducibility
set.seed(123)
# Read in TraMineR sample data
data(mvad)
# For loop which generates missing data within the sequences
for (col in 17:86) {
mvad[sample(1:nrow(mvad),(round(nrow(mvad)*0.1))),col] <- NA
}
# Create sequence object
mvad.seq <- seqdef(mvad[, 17:86])
Sequence Index Plot (missing data visible)
# Create sequence index plot
seqIplot(mvad.seq, sortv = "from.start", with.legend = "right")
State Distribution Plot (missing data removed)
# Create state distribution plot
seqdplot(mvad.seq, sortv = "from.start", with.legend = "right")
To display missing values, simply use the argument with.missing=TRUE
seqdplot(mvad.seq, sortv = "from.start", with.legend = "right",
with.missing=TRUE, border=NA)
By default, seqdef sets right missings as voids, i.e., it assumes sequences end at the last valid state. If you want also to treat (display) right missings as missing tockens, set right=NA in the seqdef command (it is right="DEL" by default):
mvad.seq <- seqdef(mvad[, 17:86], right=NA)

Table output of hierarchical clustering dendrogram in R

I have produced a dendrogram in R through hierarchical clustering analysis. I have 310 individuals that have been classified into 1 of 3 groups (my cut off, k, looks to be 3) based on 4 criteria. I have plotted the dendrogram, with the labels I want. But I am hoping to extract the results into a table which will be easier for me to use for further statistical work. I have manually gone through the small text on my dendrogram, but have found an error in my work, so I would like R to create the table for me to verify my work.
I have tried a few options from other websites, and from one entry on stackflow, but have not been successful. I would ideally want the data extraction to provide an output in this format:
columns[Individual ID, clustering group label (1-3)] #with all the results below for my 310 individuals
Here is what I have tried:
eaf.order <- matrix(data=NA, ncol=2, nrow=nrow(residency2), dimnames=list(c(), c("row.num", "row.name")))
leaf.order[,2] <- hc.complete2$labels[hc.complete2$order]
Which gives error:
Error in leaf.order[, 2] <- hc.complete2$labels[hc.complete2$order] : number of items to replace is not a multiple of replacement length

How can I extract the matrix derived from a heatmap created with gplots after hierarchical clustering?

I am making a heatmap, but I can't assign the result in a variable to check the result before plotting. Rstudio plot it automatically. I would like to get the list of rownames in the order of the heatmap. I'am not sure if this is possible. I'am using this code:
hm <- heatmap.2( assay(vsd)[ topVarGenes, ], scale="row",
trace="none", dendrogram="both",
col = colorRampPalette( rev(brewer.pal(9, "RdBu")) )(255),
ColSideColors = c(Controle="gray", Col1.7G2="darkgreen", JG="blue", Mix="orange")[
colData(vsd)$condition ] )
You can assign the plot to an object. The plot will still be drawn in the plot window, however, you'll also get a list with all the data for each plot element. Then you just need to extract the desired plot elements from the list. For example:
library(gplots)
p = heatmap.2(as.matrix(mtcars), dendrogram="both", scale="row")
p is a list with all the elements of the plot.
p # Outputs all the data in the list; lots of output to the console
str(p) # Struture of p; also lots of output to the console
names(p) # Names of all the list elements
p$rowInd # Ordering of the data rows
p$carpet # The heatmap values
You'll see all the other values associated with the dendrogram and the heatmap if you explore the list elements.
To others out there, a more complete description way to capture a matrix representation of the heatmap created by gplots:
matrix_map <- p$carpet
matrix_map <- t(matrix_map)

differences in heatmap/clustering defaults in R (heatplot versus heatmap.2)?

I'm comparing two ways of creating heatmaps with dendrograms in R, one with made4's heatplot and one with gplots of heatmap.2. The appropriate results depend on the analysis but I'm trying to understand why the defaults are so different, and how to get both functions to give the same result (or highly similar result) so that I understand all the 'blackbox' parameters that go into this.
This is the example data and packages:
require(gplots)
# made4 from bioconductor
require(made4)
data(khan)
data <- as.matrix(khan$train[1:30,])
Clustering the data with heatmap.2 gives:
heatmap.2(data, trace="none")
Using heatplot gives:
heatplot(data)
very different results and scalings initially. heatplot results look more reasonable in this case so I'd like to understand what parameters to feed into heatmap.2 to get it to do the same, since heatmap.2 has other advantages/features I'd like to use and because I want to understand the missing ingredients.
heatplot uses average linkage with correlation distance so we can feed that into heatmap.2 to ensure similar clusterings are used (based on: https://stat.ethz.ch/pipermail/bioconductor/2010-August/034757.html)
dist.pear <- function(x) as.dist(1-cor(t(x)))
hclust.ave <- function(x) hclust(x, method="average")
heatmap.2(data, trace="none", distfun=dist.pear, hclustfun=hclust.ave)
resulting in:
this makes the row-side dendrograms look more similar but the columns are still different and so are the scales. It appears that heatplot scales the columns somehow by default that heatmap.2 doesn't do that by default. If I add a row-scaling to heatmap.2, I get:
heatmap.2(data, trace="none", distfun=dist.pear, hclustfun=hclust.ave,scale="row")
which still isn't identical but is closer. How can I reproduce heatplot's results with heatmap.2? What are the differences?
edit2: it seems like a key difference is that heatplot rescales the data with both rows and columns, using:
if (dualScale) {
print(paste("Data (original) range: ", round(range(data),
2)[1], round(range(data), 2)[2]), sep = "")
data <- t(scale(t(data)))
print(paste("Data (scale) range: ", round(range(data),
2)[1], round(range(data), 2)[2]), sep = "")
data <- pmin(pmax(data, zlim[1]), zlim[2])
print(paste("Data scaled to range: ", round(range(data),
2)[1], round(range(data), 2)[2]), sep = "")
}
this is what I'm trying to import to my call to heatmap.2. The reason I like it is because it makes the contrasts larger between the low and high values, whereas just passing zlim to heatmap.2 gets simply ignored. How can I use this 'dual scaling' while preserving the clustering along the columns? All I want is the increased contrast you get with:
heatplot(..., dualScale=TRUE, scale="none")
compared with the low contrast you get with:
heatplot(..., dualScale=FALSE, scale="row")
any ideas on this?
The main differences between heatmap.2 and heatplot functions are the following:
heatmap.2, as default uses euclidean measure to obtain distance matrix and complete agglomeration method for clustering, while heatplot uses correlation, and average agglomeration method, respectively.
heatmap.2 computes the distance matrix and runs clustering algorithm before scaling, whereas heatplot (when dualScale=TRUE) clusters already scaled data.
heatmap.2 reorders the dendrogram based on the row and column mean values, as described here.
Default settings (p. 1) can be simply changed within heatmap.2, by supplying custom distfun and hclustfun arguments. However p. 2 and 3 cannot be easily addressed, without changing the source code. Therefore heatplot function acts as a wrapper for heatmap.2. First, it applies necessary transformation to the data, calculates distance matrix, clusters the data, and then uses heatmap.2 functionality only to plot the heatmap with the above parameters.
The dualScale=TRUE argument in the heatplot function, applies only row-based centering and scaling (description). Then, it reassigns the extremes (description) of the scaled data to the zlim values:
z <- t(scale(t(data)))
zlim <- c(-3,3)
z <- pmin(pmax(z, zlim[1]), zlim[2])
In order to match the output from the heatplot function, I would like to propose two solutions:
I - add new functionality to the source code -> heatmap.3
The code can be found here. Feel free to browse through revisions to see the changes made to heatmap.2 function. In summary, I introduced the following options:
z-score transformation is performed prior to the clustering: scale=c("row","column")
the extreme values can be reassigned within the scaled data: zlim=c(-3,3)
option to switch off dendrogram reordering: reorder=FALSE
An example:
# require(gtools)
# require(RColorBrewer)
cols <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
distCor <- function(x) as.dist(1-cor(t(x)))
hclustAvg <- function(x) hclust(x, method="average")
heatmap.3(data, trace="none", scale="row", zlim=c(-3,3), reorder=FALSE,
distfun=distCor, hclustfun=hclustAvg, col=rev(cols), symbreak=FALSE)
II - define a function that provides all the required arguments to the heatmap.2
If you prefer to use the original heatmap.2, the zClust function (below) reproduces all the steps performed by heatplot. It provides (in a list format) the scaled data matrix, row and column dendrograms. These can be used as an input to the heatmap.2 function:
# depending on the analysis, the data can be centered and scaled by row or column.
# default parameters correspond to the ones in the heatplot function.
distCor <- function(x) as.dist(1-cor(x))
zClust <- function(x, scale="row", zlim=c(-3,3), method="average") {
if (scale=="row") z <- t(scale(t(x)))
if (scale=="col") z <- scale(x)
z <- pmin(pmax(z, zlim[1]), zlim[2])
hcl_row <- hclust(distCor(t(z)), method=method)
hcl_col <- hclust(distCor(z), method=method)
return(list(data=z, Rowv=as.dendrogram(hcl_row), Colv=as.dendrogram(hcl_col)))
}
z <- zClust(data)
# require(RColorBrewer)
cols <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
heatmap.2(z$data, trace='none', col=rev(cols), Rowv=z$Rowv, Colv=z$Colv)
Few additional comments regarding heatmap.2(3) functionality:
symbreak=TRUE is recommended when scaling is applied. It will adjust the colour scale, so it breaks around 0. In the current example, the negative values = blue, while the positive values = red.
col=bluered(256) may provide an alternative colouring solution, and it doesn't require RColorBrewer library.

1-D conditional slice from a 2-D probability density function in R using np package

consider the included example in the np-package for r,
page 21 of the Vignettes for np package.
npcdens returns a conditional density object and is able to plot 2d-pdf and 2d-cdf, as shown. I wanted to know if I can somehow extract the 1-D information (pdf / cdf) from the object if I were to specify one of the two parameters, like in a vector or something ?? I am new to R and was not able to find out the format of the object.
Thanks for the help.
-Egon.
Here is the code as requested:
require(np)
data("Italy")
attach(Italy)
bw <- npcdensbw(formula=gdp~ordered(year), tol=.1, ftol=.1)
fhat <- npcdens(bws=bw)
summary(fhat)
npplot(bws=bw)
npplot(bws=bw, cdf=TRUE)
detach(Italy)
The fhat object contains all the needed info plus a whole lot more. To see what all is in there, do a str( fhat ) to see the structure.
I believe the values you are interested in are xeval, yeval, and condens (PDF density).
There are lots of ways to get at the values but I tend to like data frames. I'd pop the three vectors in a single data frame:
denDf <- cbind( year=as.character( fhat$xeval[,1] ), fhat$yeval, fhat$condens )
## had to do a dance around the year variable because it's a factor
then I'd select the values I want with a subset():
subset( denDf, year==1951 & gdp > 8 & gdp < 8.2)
since gdp is a floating point value it's very hard to select with a == operator.
The method suggested by JD Long will only extract density for data points in the existing training set. If you want the density at other points (conditioning or conditional variables) you will need to use the predict()
function. The following code extracts and plots the 1-D density distribution conditioned on year ==1999, a value not contained in the original data set.
First construct a data frame with the same components as the Italy data set, with gdp regularly spaced and with "1999" an ordered factor.
yr1999<- rep("1999", 100)
gdpVals <-seq(1,35, length.out=100)
nD1999 <- data.frame(year = ordered(yr1999), gdp = gdpVals)
Next use the predict function to extract the densities.
gdpDens1999 <-predict(fhat,newdata = nD1999)
The following code plots the density.
plot(gdpVals, gdpDens1999, type='l', col='red', xlab='gdp', ylab = 'p(gdp|yr = 1999)')

Resources