Superheat Use Given Cluster Group - r

I generated dendrogram and have obtained k clusters. I wanted to obtain heatmap by using the existing dendrogram and clusters. I want to display my heatmap in multiple pages so I did them one by one in separate calls. To make the colour chart comparable, I also set pal.values. Here’s my code sample:
sh <- superheat(X = cbind(dt[grep("Action", colnames(dt))], profile),
heat.pal = brewer.pal(8, 'RdBu')[8:1],
heat.pal.values = c(-1.4:2.3),
bottom.label = 'variable',
grid.hline.col = '#F0EBEB',
grid.vline.col = '#F0EBEB',
smooth.heat = TRUE,
membership.rows = hc$cluster,
column.title = 'factors', row.title = 'Clusters',
bottom.label.text.size = 1.5, bottom.label.text.angle = 90)
sh$membership.rows == hc$cluster
From there, when I did checking if the memberships are matched, its displayed a mix of TRUE AND FALSE. I couldn’t find out what’s wrong with my code. Would you please enlighten me?

Related

Change name of groups in bal.plot

I am trying to visualize results from MatchIt procedure with bal.plot() from cobalt package.
It works just fine, except I would like to change the lables for the group which by default are "Unadjusted sample" and "Adjusted sample".
bal.plot(AHEAD_nomiss, var.name = "KCH_TKS", which = "both",
type = "histogram", mirror = F,
weights = AHEAD_nomiss$att.weights, treat = AHEAD_nomiss$group)
Author of cobalt package here. Thank you for using my package!
Edit. Original post at the bottom.
I just added some functionality to bal.plot for this in the development version of cobalt, which can be installed with devtools::install_github("ngreifer/cobalt"). Use the sample.names argument to supply a vector of names to give bal.plot and they'll appear in the facet labels. The vector should be as long as the number of samples (in your case, 2). Your new code should look like this:
bal.plot(AHEAD_nomiss, var.name = "KCH_TKS", which = "both",
type = "histogram", mirror = F,
weights = AHEAD_nomiss$att.weights, treat = AHEAD_nomiss$group,
sample.names = c("UNWEIGHTED", "WEIGHTED"))
Of course you can change the names. If you don't want to install the development version of cobalt (it't not guaranteed to be stable), you can use my solutions below.
I didn't intend bal.plot to be used for publication so I didn't make it super flexible, unlike love.plot. One thing you can do is manually program the histograms using ggplot2. Of course, this requires you learning how to use ggplot2, which can be a challenge, and looking at the source code of bal.plot probably won't help because of all the checks and transformations that occur. Here's some code that might work for you:
unweighted <- data.frame(KCH_TKS = AHEAD_nomiss$KCH_TKS,
treat = factor(AHEAD_nomiss$group),
weights = 1,
adj = "UNWEIGHTED",
stringsAsFactors = FALSE)
weighted <- data.frame(KCH_TKS = AHEAD_nomiss$KCH_TKS,
treat = factor(AHEAD_nomiss$group),
weights = AHEAD_nomiss$att.weights,
adj = "WEIGHTED",
stringsAsFactors = FALSE)
data <- rbind(unweighted, weighted)
ggplot(data, aes(x = KCH_TKS, fill = treat)) +
geom_histogram(aes(weight = weights), bins = 10, alpha = .4, color = "black") +
facet_grid(~adj)
One way you can hack bal.plot is to provide a set of weights that are all equal to 1 as well as your desired weights and leave which at its default. If you give the weights names, those names will appear on the facet labels. So, for your example, try
bal.plot(AHEAD_nomiss, var.name = "KCH_TKS",
type = "histogram", mirror = F,
weights = list(UNWEIGHTED = rep(1, nrow(AHEAD_nomiss),
WEIGHTED = AHEAD_nomiss$att.weights),
treat = AHEAD_nomiss$group)
You should see that "UNWEIGHTED" and "WEIGHTED" are the new facet label names. You can of course change them to be whatever you want.

How to reorder cluster leaves (columns) when plotting pheatmap in R?

I am plotting a set of 15 samples clustered in three groups A, B, C, and the heatmap orders them such as C, A, B. (I have read this is due to that it plots on the right the cluster with the strongest similarity). I would like to order the clusters so the leaves of the cluster are seen as A, B, C (therefore reorganising the order of the cluster branches. Is there a function that can help me do this?
The code I have used:
library(pheatmap)
pheatmap(mat, annotation_col = anno,
color = colorRampPalette(c("blue", "white", "red"))(50), show_rownames = F)
(cluster_cols=FALSE would not cluster the samples at all, but that is not what I want)
I have also found on another forum this, but I am unsure how to change the function code and if it would work for me:
clustering_callback callback function to modify the clustering. Is
called with two parameters: original hclust object and the matrix used
for clustering. Must return a hclust object.
Hi I am not sure if that is of any help for you but when you check?pheatmap and scroll down to examples the last snippet of code actually does give that example.
# Modify ordering of the clusters using clustering callback option
callback = function(hc, mat){
sv = svd(t(mat))$v[,1]
dend = reorder(as.dendrogram(hc), wts = sv)
as.hclust(dend)
}
pheatmap(test, clustering_callback = callback)
I tried it on my heatmap and the previously defined function actually sorted the clusters exactly the way I needed them. Although I have to admit (as I am new to R) I don't fully understand what the defined callback function does.
Maybe you can also write a function with the dendsortpackage as I know you can reorder the branches of a dendrogram with it.
In this case, luckily clustering of the columns coincides with sample number order, (which is similar to dendrogram) so I added cluster_cols = FALSE and solved the issue of re-clustering the columns (and avoided writing the callback function.
pheatmap(mat,
annotation_col = anno,
fontsize_row = 2,
show_rownames = T,
cutree_rows = 3,
cluster_cols = FALSE)
# install.packages("dendsort")
library(dendsort)
sort_hclust <- function(...) as.hclust(dendsort(as.dendrogram(...)))
cluster_cols=sort_hclust(hclust(dist(mat)))

Generating Clusplot for K Prototype in R

I have a customer dataset with a mix continuous and categorical variables, and would like to do cluster the customers into groups. Am trying to use k prototype for the first time, but how would I get a nice, visual representation similar to cusplot for kmeans?
install.packages("clustMixType")
library(clustMixType)
data = read.csv("customerdata.csv", header = TRUE)
kproto = kproto(data, k=5, lambda = NULL, iter.max = 100, nstart = 1,
keep.data = TRUE)
clprofiles(kproto, data, vars = NULL, col = NULL)
Don't rely on a black box function.
Study what clusplot does, and adapt it to suit your needs exactly. Get the source code, and check what it does.
The answer lies in your code where krpoto is the object and data as the data frame
clprofiles(kproto, data, vars = NULL, col = NULL)

R Circular Chord Plots

Im learning how to create circular plots in R, similiar to CIRCOS
Im using the package circlize to draw links between origin and destination pairs based on if the flight was OB, Inbound and Return. The logic fo the data doesnt really matter, its just a toy example
I have gotten the plot to work based on the code below which works based on the following logic
Take my data, combine destination column with the flight type
Convert to a matrix and feed the origin and the new column into circlize
Reference
library(dplyr)
library(circlize)
# Create Fake Flight Information in a table
orig = c("IE","GB","US","ES","FI","US","IE","IE","GB")
dest = c("FI","FI","ES","ES","US","US","FI","US","IE")
direc = c("IB","OB","RETURN","DOM","OB","DOM","IB","RETURN","IB")
mydf = data.frame(orig, dest, direc)
# Add a column that combines the dest and direction together
mydf <- mydf %>%
mutate(key = paste(dest,direc)) %>%
select (orig, key)
# Create a Binary Matrix Based on mydf
mymat <- data.matrix(as.data.frame.matrix(table(mydf)))
# create the objects you want to link from to in your diagram
from <- rownames(mymat)
to <- colnames(mymat)
# Create Diagram by suppling the matrix
par(mar = c(1, 1, 1, 1))
chordDiagram(mymat, order = sort(union(from, to)), directional = TRUE)
circos.clear()
I like the plot a lot but would like to change it a little bit. For example FI (which is Finland) has 3 measurements on the diagram FI IB, FI OB and FI. I would like to combine them all under FI if possible and distinguish between the three Types of flights using either a colour scheme, Arrows or even adding an additional track which acts as an umbrella for IB OB and RETURN flights
So for Example,
FI OB would be placed in FI but have a one way arrow to GB to signify OB
FI IB would be placed in FI but have a one way arrow into FI
FI RETURN (if it exists) would have a double headed arrow
Can anyone help, Has anyone seen anything similiar been done before?
The end result should just have the countries on the plot once so that someone can see very quickly which countries have the most amount of flights
I have tried following other posts but am afraid im getting lost when they move to the more advanced stuff
Thank you very much for your time
First, I think there is a duplicated record (IE-FI-IB) in your data.
I will first attach the code and figure and then explain a little bit.
df = data.frame(orig, dest, direc, stringsAsFactors = FALSE)
df = unique(df)
col = c("IB" = "red",
"OB" = "blue",
"RETURN" = "orange",
"DOM" = "green")
directional = c("IB" = -1,
"OB" = 1,
"RETURN" = 2,
"DOM" = 0)
diffHeight = c("IB" = -0.04,
"OB" = 0.04,
"RETURN" = 0,
"DOM" = 0)
chordDiagram(df[1:2], col = col[df[[3]]], directional = directional[df[[3]]],
direction.type = c("arrows+diffHeight"),
diffHeight = diffHeight[df[[3]]])
legend("bottomleft", pch = 15, legend = names(col), col = col)
First you need to use the development version of circlize for which
you can install it by
devtools::install_github("jokergoo/circlize")
In this new version, chordDiagram() supports input variable as a data frame and drawing two-head arrows for the links (just after reading your post :)).
In above code, col, directional, direction.type and diffHeight can all be set as a vector which corresponds to rows in df.
When directional argument in chordDiagram() is set to 2, the corresponding link will have two directions. Then if direction.type contains arrows, there will be a two-head arrow.
Since diffHeight is a vector which correspond to rows in df, if you want to visualize the direction for a single link both by arrow and offset of the roots, you need to merge these two options as a single string as shown in the example code "arrows+diffHeight".
By default direction for links are from the first column to the second column. But in your case, IB means the reversed direction, so we need to set diffHeight to a negative value to reverse the default direction.
Finally, I observe you have links which start and end in a same sector (ES-ES-DOM and US-US-DOM), you can use self.link argument to control how to represent such self-link. self.link is set to 1 in following figure.
Do you need the arrows because the color coding in the graph is telling the From / To story already (FROM -> color edge FROM COUNTRY, TO is color of the FROM COUNTRY arriving at the TO COUNTRY, IF FROM == TO Its own color returns at its own base (see US or ES for example)).
library(dplyr)
library(circlize)
# Create Fake Flight Information in a table
orig = c("IE","GB","US","ES","FI","US","IE","IE","GB")
dest = c("FI","FI","ES","ES","US","US","FI","US","IE")
mydf = data.frame(orig, dest)
# Create a Binary Matrix Based on mydf
mymat <- data.matrix(as.data.frame.matrix(table(mydf)))
# create the objects you want to link from to in your diagram
from <- rownames(mymat)
to <- colnames(mymat)
# Create Diagram by suppling the matrix
par(mar = c(1, 1, 1, 1))
chordDiagram(mymat, order = sort(union(from, to)), directional = TRUE)
circos.clear()
BY the way -> there is also a OFFSET difference on the edge that tells if it is FROM (wider edge) or TO (smaller edge)

Error plotting Kohonen maps in R?

I was reading through this blog post on R-bloggers and I'm confused by the last section of the code and can't figure it out.
http://www.r-bloggers.com/self-organising-maps-for-customer-segmentation-using-r/
I've attempted to recreate this with my own data. I have 5 variables that follow an exponential distribution with 2755 points.
I am fine with and can plot the map that it generates:
plot(som_model, type="codes")
The section of the code I don't understand is the:
var <- 1
var_unscaled <- aggregate(as.numeric(training[,var]),by=list(som_model$unit.classif),FUN = mean, simplify=TRUE)[,2]
plot(som_model, type = "property", property=var_unscaled, main = names(training)[var], palette.name=coolBlueHotRed)
As I understand it, this section of the code is suppose to be plotting one of the variables over the map to see what it looks like but this is where I run into problems. When I run this section of the code I get the warning:
Warning message:
In bgcolors[!is.na(showcolors)] <- bgcol[showcolors[!is.na(showcolors)]] :
number of items to replace is not a multiple of replacement length
and it produces the plot:
Which just some how doesn't look right...
Now what I think it has come down to is the way the aggregate function has re-ordered the data. The length of var_unscaled is 789 and the length of som_model$data, training[,var] and unit.classif are all of length 2755. I tried plotting the aggregated data, the result was no warning but an unintelligible graph (as expected).
Now I think it has done this because unit.classif has a lot of repeated numbers inside it and that's why it has reduced in size.
The question is, do I worry about the warning? Is it producing an accurate graph? What exactly is the "Property"'s section looking for in the plot command? Is there a different way I could "Aggregate" the data?
I think that you have to create the palette color. If you put the argument
coolBlueHotRed <- function(n, alpha = 1) {rainbow(n, end=4/6, alpha=alpha)[n:1]}
and then try to get a plot, for example
plot(som_model, type = "count", palette.name = coolBlueHotRed)
the end is succesful.
This link can help you: http://rgm3.lab.nig.ac.jp/RGM/R_rdfile?f=kohonen/man/plot.kohonen.Rd&d=R_CC
I think that not all of the cells on your map have points inside.
You have 30 by 30 map and about 2700 points. In average it's about 3 points per cell. With high probability some cells have more than 3 points and some cells are empty.
The code in the post on R-bloggers works well when all of the cells have points inside.
To make it work on your data try change this part:
var <- 1
var_unscaled <- aggregate(as.numeric(training[, var]), by = list(som_model$unit.classif), FUN = mean, simplify = TRUE)[, 2]
plot(som_model, type = "property", property = var_unscaled, main = names(training)[var], palette.name = coolBlueHotRed)
with this one:
var <- 1
var_unscaled <- aggregate(as.numeric(data.temp[, data.classes][, var]),
by = list(som_model$unit.classif),
FUN = mean,
simplify = T)
v_u <- rep(0, max(var_unscaled$Group.1))
v_u[var_unscaled$Group.1] <- var_unscaled$x
plot(som_model,
type = "property",
property = v_u,
main = colnames(data.temp[, data.classes])[var],
palette.name = coolBlueHotRed)
Hope it helps.
Just add these functions to your script:
coolBlueHotRed <- function(n, alpha = 1) {rainbow(n, end=4/6, alpha=alpha)[n:1]}
pretty_palette <- c("#1f77b4","#ff7f0e","#2ca02c", "#d62728","#9467bd","#8c564b","#e377c2")

Resources