Dendrogram and HistDAWass package - r

I am using the HistDAWass package (https://cran.r-project.org/web/packages/HistDAWass/index.html) to perform clustering using a script partially provided by the package author.
As the Data1.csv files does not include a column with the row name sample (labels) I get a dendrogram that mark the tree labels as I1...I6.
Therefore, I tried to work with a new file (Data2.csv) which its first column include the labels but I get an error.
I will appreciate if someone can explain me how to generate the dendrogram with the new labels.
Script:
library(HistDAWass)
data=read.csv('D:/Data1.csv', header = FALSE)
data=t(data)
Hdata=MatH(nrows=6,ncols = 1)
for (i in 1:get.MatH.nrows(Hdata)){
tmp=data2hist(as.vector(data[,i]))
Hdata#M[i,1][[1]]=tmp
}
results=WH_hclust(x = Hdata,simplify = TRUE, method="complete")
plot(results) # it plots the dendrogram
Data files (in zip):
http://ge.tt/8yVsiQS2/v/0

The script contains a way for generating a matrix, where, in each cell there is a distributionH object. From raw data (for each row of the csv file) a distributionH in the for cycle, a new MatH (a matrix of distributions) is build.
For building the same from Data2.csv file you should run the following script
library(HistDAWass)
#read data
data=read.csv('Data2.csv', header = FALSE)
#initialize an empty MatH matrix using names from the firs colum of data
Hdata=MatH(nrows=nrow(data),rownames=as.list(as.character(data[,1])),ncols = 1)
#Fill the matrix
for (i in 1:get.MatH.nrows(Hdata)){
tmp=data2hist(as.vector(t(data[i,2:ncol(data)])))
Hdata#M[i,1][[1]]=tmp
}
#Do hierarchical clustering
results=WH_hclust(x = Hdata,simplify = TRUE, method="complete")
plot(results) # it plots the dendrogram

Related

get results of principal Component Analysis in R

I want to get the results of PC1 and PC2 to plot courbe of both in the same graph with tableau desktop.
How to do?
data = read.csv(file="data.csv",header=TRUE, sep=";")
data.active <- data[, 1:30]
library(factoextra)
res.pca <- prcomp(data.active,center = TRUE, scale. = TRUE)
fviz_eig(res.pca)
I think you need to write a csv with the results in between R and Tableau. The code for that is written bellow :
# Principal Components Analysis
res.pca <- stats::prcomp(iris[,-5],center = TRUE, scale. = TRUE)
# Choose number of dimension kept
factoextra::fviz_eig(res.pca)
# Some visualisation
factoextra::fviz_pca_var(res.pca)
factoextra::fviz_pca_ind(res.pca)
factoextra::fviz_pca_biplot(res.pca)
# access transformed points
str(res.pca)
res.pca$x
# save points in csv to use outside of R
utils::write.csv(x = res.pca$x, file = "path/data_pca.csv")
# Load your data and do graphs the usual way with tableau
I used ?prcomp to find the data in the result, you may also push further your analysis and use some nice graphics (biplots of individual / variable, clustering, ...) with R (and import only images in Tableau) using : link

How to make R output text details about a dendrogram object?

Please see my previous question for details relating to test data and commands used to create a dendrogram: Using R to cluster based on euclidean distance and a complete linkage metric, too many vectors?
Here is a quick summary of my commands to make the dendrogram:
un_exprs <- as.matrix(read.table("sample.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
exprs <- t(un_exprs)
eucl_dist=dist(exprs,method = 'euclidean')
hie_clust=hclust(eucl_dist, method = 'complete')\
dend <- as.dendrogram(hie_clust)
plot(dend)
This makes a very nice dengrogram plot. However, lets say this dendrogram has 2 clusters... I want to get a text list of each element belonging to each of the 2 clusters. I'm assuming this is trivial, but I don't have enough experience with R for this to be intuitive. Thanks!
You can compute this from the hclust return with stats::cutree
cutree(hie_clust,k=2)

Extracting values from a graph

I have a graph that is created by complex numbers from the function below. I would like to extract the resulting data points which correpond with the line from the data plot as to be able to work with a vector of data.
library(multitaper)
NW<-10
K<-5
x<-c(2,3,1,3,4,6,7,8,5,4,3,2,4,5,7,8,6,4,3,2,4,5,7,8,6,4,5,3,2,5,7,8,6,4,5,3,6,7,8,8,9,7,6,5,4,7)
resSpec <- spec.mtm(as.ts(x), k= K, nw=NW, nFFT = length(x),
centreWithSlepians = TRUE, Ftest = TRUE,
jackknife = FALSE, maxAdaptiveIterations = 100,
plot =FALSE, na.action = na.fail)
plot(resSpec)
What would be the best procedure. I have tried saving the plot in emf. I wanted to use package ReadImages which was I believe the right package. (however this was not available for R versiĆ³n 3.02 so I could not use it). What would be the correct procedure of saving and extracting and are there other packages and in what file types could I save the graph (as far as I can see R (OS windows) only permist emf.)
Any help welcomed

R programming - Graphic edges too large error while using clustering.plot in EMA package

I'm an R programming beginner and I'm trying to implement the clustering.plot method available in R package EMA. My clustering works fine and I can see the results populated as well. However, when I try to generate a heat map using clustering.plot, it gives me an error "Error in plot.new (): graphic edges too large". My code below,
#Loading library
library(EMA)
library(colonCA)
#Some information about the data
data(colonCA)
summary(colonCA)
class(colonCA) #Expression set
#Extract expression matrix from colonCA
expr_mat <- exprs(colonCA)
#Applying average linkage clustering on colonCA data using Pearson correlation
expr_genes <- genes.selection(expr_mat, thres.num=100)
expr_sample <- clustering(expr_mat[expr_genes,],metric = "pearson",method = "average")
expr_gene <- clustering(data = t(expr_mat[expr_genes,]),metric = "pearson",method = "average")
expr_clust <- clustering.plot(tree = expr_sample,tree.sup=expr_gene,data=expr_mat[expr_genes,],title = "Heat map of clustering",trim.heatmap =1)
I do not get any error when it comes to actually executing the clustering process. Could someone help?
In your example, some of the rownames of expr_mat are very long (max(nchar(rownames(expr_mat)) = 271 characters). The clustering_plot function tries to make a margin large enough for all the names but because the names are so long, there isn't room for anything else.
The really long names seem to have long stretches of periods in them. One way to condense the names of these genes is to replace runs of 2 or more periods with just one, so I would add in this line
#Extract expression matrix from colonCA
expr_mat <- exprs(colonCA)
rownames(expr_mat)<-gsub("\\.{2,}","\\.", rownames(expr_mat))
Then you can run all the other commands and plot like normal.

Displaying TraMineR (R) dendrograms in text/table format

I use the following R code to generate a dendrogram (see attached picture) with labels based on TraMineR sequences:
library(TraMineR)
library(cluster)
clusterward <- agnes(twitter.om, diss = TRUE, method = "ward")
plot(clusterward, which.plots = 2, labels=colnames(twitter_sequences))
The full code (including dataset) can be found here.
As informative as the dendrogram is graphically, it would be handy to get the same information in text and/or table format. If I call any of the aspects of the object clusterward (created by agnes), such as "order" or "merge" I get everything labeled using numbers rather than the names I get from colnames(twitter_sequences). Also, I don't see how I can output the groupings represented graphically in the dendrogram.
To summarize: How can I get the cluster output in text/table format with the labels properly displayed using R and ideally the traminer/cluster libraries?
The question concerns the cluster package. The help page for the agnes.object returned by agnes
(See http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/agnes.object.html ) states that this object contains an order.lab component "similar to order, but containing observation labels instead of observation numbers. This component is only available if the original observations were labelled."
The dissimilarity matrix (twitter.om in your case) produced by TraMineR does currently not retain the sequence labels as row and column names. To get the order.lab component you have to manually assign sequence labels as both the rownames and colnames of your twitter.om matrix. I illustrate here with the mvad data provided by the TraMineR package.
library(TraMineR)
data(mvad)
## attaching row labels
rownames(mvad) <- paste("seq",rownames(mvad),sep="")
mvad.seq <- seqdef(mvad[17:86])
## computing the dissimilarity matrix
dist.om <- seqdist(mvad.seq, method = "OM", indel = 1, sm = "TRATE")
## assigning row and column labels
rownames(dist.om) <- rownames(mvad)
colnames(dist.om) <- rownames(mvad)
dist.om[1:6,1:6]
## Hierarchical cluster with agnes library(cluster)
cward <- agnes(dist.om, diss = TRUE, method = "ward")
## here we can see that cward has an order.lab component
attributes(cward)
That is for getting order with sequence labels rather than numbers. But now it is not clear to me which cluster outcome you want in text/table form. From the dendrogram you decide of where you want to cut it, i.e., the number of groups you want and cut the dendrogram with cutree, e.g. cl.4 <- cutree(clusterward1, k = 4). The result cl.4 is a vector with the cluster membership for each sequence and you get the list of the members of group 1, for example, with rownames(mvad.seq)[cl.4==1].
Alternatively, you can use the identify method (see ?identify.hclust) to select the groups interactively from the plot, but need to pass the argument as as.hclust(cward). Here is the code for the example
## plot the dendrogram
plot(cward, which.plot = 2, labels=FALSE)
## and select the groups manually from the plot
x <- identify(as.hclust(cward)) ## Terminate with second mouse button
## number of groups selected
length(x)
## list of members of the first group
x[[1]]
Hope this helps.

Resources