Automatically categorize and add annotations using pheatmap in R - r

I have a dataframe made by the school grades of some students in different subjects. The students are also characterized by their gender (F or M), that is included as a suffix in their names (e.g. Anne_F, Albert_M, etc...)
With these data I have created an heatmap with the package pheatmap(), in this way:
library(pheatmap)
Anne_F <- c(9,7,6,10,6)
Carl_M <- c(6,7,9,5,7)
Albert_M <- c(8,8,8,7,9)
Kate_F <- c(10,5,10,9,5)
Emma_F <- c(6,8,10,8,7)
matrix <- cbind(Anne_F, Carl_M, Albert_M, Kate_F, Emma_F)
rownames(matrix) <- c("Math", "Literature", "Arts", "Science", "Music")
print(matrix)
heatmap <- pheatmap(
mat = matrix,
cluster_rows = F,
cluster_cols = F,
cellwidth = 30,
cellheight = 30,
)
heatmap
Which gives this matrix
and the relative plot:
Now I would like to automatically recognize if a student is Male or Female and add this as a column annotation in the heatmap, in order to have a graph like this:
I have thought to create two vectors, one with the name of the students:
name <- c("Anne", "Carl", "Albert", "Kate", "Emma") and one with the respective genders: gender <- c("F", "M", "M", "F", "F") , but I can't figure out how to associate names with genders, and to show them on the heatmap.
I don't mean to manually associate one-name to one-gender (as Anne to F, Albert to M, etc,). I need to take the entire vector of names and associate it with the corresponding vector of genders (and then annotate them on the heatmap), because their number will increase in the future.
Many thanks in advance for your help.

You need to use annotation_col option in pheatmap.
library(pheatmap)
# split matrix into "Name" and "Gender"
name_gender_matrix <- str_split_fixed(colnames(matrix), "_", 2)
# Data that maps to the heatmap should be set at the rownames
annot_col <- data.frame(row.names = name_gender_matrix[, 1], Gender = name_gender_matrix[, 2])
# Align the column name of your matrix and with the annotation
colnames(matrix) <- rownames(annot_col)
heatmap <- pheatmap(
mat = matrix,
cluster_rows = F,
cluster_cols = F,
cellwidth = 30,
cellheight = 30,
annotation_col = annot_col
)

With the given data, you could achieve your desired output like this:
Gender <- sapply(colnames(matrix), function(x) strsplit(x, "_")[[1]][2])
df <- as.data.frame(Gender)
pheatmap(
mat = matrix,
cluster_rows = F,
cluster_cols = F,
cellwidth = 30,
cellheight = 30,
annotation_col = df,
annotation_colors = list(Gender = c(M = "#6ef88a", F = "#d357fe"))
)

Related

Mahalanobis difference by group with dplyr

I want to get a Mahalanobis difference for each set of two scores, after being grouped by another variable. In this case, it would be a Mahalanobis difference for each Attribute (across each set of 2 scores). The output should be 3 Mahalanobis distances (one for A, B and C).
Currently I am working with (in my original dataframe, there are some NAs, hence I include one in the reprex):
library(tidyverse)
library(purrr)
df <- tibble(Attribute = unlist(map(LETTERS[1:3], rep, 5)),
Score1 = c(runif(7), NA, runif(7)),
Score2 = runif(15))
mah_db <- df %>%
dplyr::group_by(Attribute) %>%
dplyr::summarise(MAH = mahalanobis(Score1:Score2,
center = base::colMeans(Score1:Score2),
cov(Score1:Score2, use = "pairwise.complete.obs")))
This raises the error:
Caused by error in base::colMeans(): ! 'x' must be an array of at
least two dimensions
But as far as I can tell, I am giving colMeans two columns.
So what's going wrong here? And I wonder if even fixing this gives a complete solution?
It seems your question is more about the statistics than dplyr. So I just give a small example based on your data and an adapted example from ?mahalanobis. Perhaps also have a look here or here.
df <- subset(x = df0, Attribute == "A", select = c("Score1", "Score2"))
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
plot(density(df$mahalanobis, bw = 0.3), ylim = c(0, 0.8),
main="Squared Mahalanobis distances");
grid()
rug(df$mahalanobis)
df <- subset(x = df0, Attribute == "B", select = c("Score1", "Score2"))
df <- df[complete.cases(df), ]
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
lines(density(df$mahalanobis, bw = 0.3), col = "red",
main="Squared Mahalanobis distances");
rug(df$mahalanobis, col = "red")
df <- subset(x = df0, Attribute == "C", select = c("Score1", "Score2"))
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
lines(density(df$mahalanobis, bw = 0.3), col = "green",
main="Squared Mahalanobis distances");
rug(df$mahalanobis, col = "green")
Hope, that helps (and too long for a comment).
(Of course you can make to code much shorter, but it shows in each step what happens.)

Perform an operation with complete cases without changing the original vectors

I would like to calculate a rank-biserial correlation. But the (only it seems) package can't handle missing values that well. It has no built in "na.omit = TRUE" function. I could remove the missings in the data frame, but that would be a hustle with many different calculations.
n <- 500
df <- data.frame(id = seq (1:n),
ord = sample(c(0:3), n, rep = TRUE),
sex = sample(c("m", "f"), n, rep = TRUE, prob = c(0.55, 0.45))
)
df <- as.data.frame(apply (df, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} ))
library(rcompanion)
wilcoxonRG(x = df$ord, g = df$sex, verbose = T)
I imagine something stupidly easy like "complete.cases(wilcoxonRG(x = df$ord, g = df$sex, verbose = T)). It's probably not that hard but I could only find comeplete data frame manipulations. Thanks in advance!

How to get z-score distribution for 3 dataframes and plot all in one graph

I'm trying to create a single graph that contains boxplots of gene expression for 3 different variant types (synonymous, missense, and nonsense). Currently, these variant types are separated into 3 different data frames, each of which contain a Gene, SampleID, and Expression column.
In order to plot all 3 boxplots on a single graph, I need to normalize all the expression data for each variant type, which means I need to get the z-scores. My question is, how do I do that and then how do I plot all 3 variant types on one graph?
I've come across the solution:
missense$Zscore <- ave(m$expr, m$Gene, FUN = scale)
nonsense$Zscore <- ave(n$expr, n$Gene, FUN = scale)
synonymous$Zscore <- ave(s$expr, s$Gene, FUN = scale)
Is this the right approach? If so, where do I go from here?
Example dataframe (missense):
SampleID Expression Gene
HSB100 5.239237 ENSG00000188976
HSB105 4.443808 ENSG00000188976
HSB104 4.425764 ENSG00000188976
HSB121 4.063259 ENSG00000188976
Use scale function to get Z-scores.
missense <- data.frame(SampleID = c('HSB100', 'HSB105', 'HSB104', 'HSB121'),
Expression = c(5.239237, 4.443808, 4.425764, 4.063259),
Gene = c('ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976'))
missense$Zscore <- scale(missense$Expression)
missense
mean(missense$Zscore)
sd(missense$Zscore)
# Create fake data here
nonsense <-
data.frame(SampleID = c('HSB100', 'HSB105', 'HSB104', 'HSB121'),
Expression = c(1, 2, 3, 4),
Gene = c('ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976'))
nonsense$Zscore <- scale(nonsense$Expression)
synonymous <-
data.frame(SampleID = c('HSB100', 'HSB105', 'HSB104', 'HSB121'),
Expression = c(3, 4, 5, 6),
Gene = c('ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976', 'ENSG00000188976'))
synonymous$Zscore <- scale(synonymous$Expression)
The trick is to bind all three data frames together and then plot using ggplot. Not familiar with base plot but this is what I would do:
# Add identifyer
missense$Type <- 'missense'
nonsense$Type <- 'nonsense'
synonymous$Type <- 'synonymous'
# Bind three together
data_all <- rbind(missense, nonsense, synonymous)
# Use ggplot to plot boxscores
library(ggplot2)
ggplot(data = data_all, aes(x = Type, y = Zscore)) + geom_boxplot()
If all the genes are the same in each corresponding data frame, then ave is not needed since no multiple groupings exist. Hence, you can run a simple calculation: m$Zscore <- scale(m$expr). From there as #emilliman5 comments, graph all three vectors with a list and even name x-axis with a named list:
# WITH SEABORN COLORS
boxplot(list(missense=m$Zscore, nonsense=n$Zscore, synonymous=s$Zscore),
col = c("#4c72b0","#55a868","#c44e52"))
Even consider row binding all data frames but adding a new column for a variant_type indicator. Then use ave since now genes will differ within data frame. And even use formula style instead of list() for boxplot:
all_gene_df <- rbind(transform(m, variant_type='missense'),
transform(n, variant_type='nonsense'),
transform(s, variant_type='synonymous'))
all_gene_df$Zscore <- with(all_gene_df, ave(expr, variant_type, FUN = scale))
# WITH SEABORN COLORS
boxplot(Zscore ~ variant_type, data = all_gene_df,
col = c("#4c72b0","#55a868","#c44e52"),
main = "ZScore Boxplots by Gene",
xlab = "Genes",
ylab = "ZScore")
Data
set.seed(103018)
m <- data.frame(SampleID = paste0(sample(LETTERS, 50, replace=TRUE), sample(LETTERS, 50, replace=TRUE),
sample(LETTERS, 50, replace=TRUE), sample(100:999, 50, replace=TRUE)),
expr = runif(50)*10,
gene = 'MISSENSE0001')
n <- data.frame(SampleID = paste0(sample(LETTERS, 50, replace=TRUE), sample(LETTERS, 50, replace=TRUE),
sample(LETTERS, 50, replace=TRUE), sample(100:999, 50, replace=TRUE)),
expr = runif(50)*10,
gene = 'NONSENSE0001')
s <- data.frame(SampleID = paste0(sample(LETTERS, 50, replace=TRUE), sample(LETTERS, 50, replace=TRUE),
sample(LETTERS, 50, replace=TRUE), sample(100:999, 50, replace=TRUE)),
expr = runif(50)*10,
gene = 'SYNONYMOUS0001')

How to color branches in R dendogram as a function of the classes in it?

I wish to visualize how well a clustering algorithm is doing (with certain distance metric). I have samples and their corresponding classes.
To visualize, I cluster and I wish to color the branches of a dendrogram by the items in the cluster. The color will be the color most items in the hierarchical cluster correspond to (given by the data\classes).
Example: If my clustering algorithm chose indexes 1,21,24 to be a certain cluster (at a certain level) and I have a csv file containing a class number in each row corresponding to lets say 1,2,1. I want this edge to be coloured 1.
Example Code:
require(cluster)
suppressPackageStartupMessages(library(dendextend))
dir <- 'distance_metrics/'
filename <- 'aligned.csv'
my.data <- read.csv(paste(dir, filename, sep=""), header = T, row.names = 1)
my.dist <- as.dist(my.data)
real.clusters <-read.csv("clusters", header = T, row.names = 1)
clustered <- diana(my.dist)
# dend <- colour_branches(???dend, max(real.clusters)???)
plot(dend)
EDIT:
another example partial code
dir <- 'distance_metrics/' # csv in here contains a symmetric matrix
clust.dir <- "clusters/" #csv in here contains a column vector with classes
my.data <- read.csv(paste(dir, filename, sep=""), header = T, row.names = 1)
filename <- 'table.csv'
my.dist <- as.dist(my.data)
real.clusters <-read.csv(paste(clust.dir, filename, sep=""), header = T, row.names = 1)
clustered <- diana(my.dist)
dnd <- as.dendrogram(clustered)
Both node and edge color attributes can be set recursively on "dendrogram" objects (which are just deeply nested lists) using dendrapply. The cluster package also features an as.dendrogram method for "diana" class objects, so conversion between the object types is seamless. Using your diana clustering and borrowing some code from #Edvardoss iris example, you can create the colored dendrogram as follows:
library(cluster)
set.seed(999)
iris2 <- iris[sample(x = 1:150,size = 50,replace = F),]
clust <- diana(iris2)
dnd <- as.dendrogram(clust)
## Duplicate rownames aren't allowed, so we need to set the "labels"
## attributes recursively. We also label inner nodes here.
rectify_labels <- function(node, df){
newlab <- df$Species[unlist(node, use.names = FALSE)]
attr(node, "label") <- (newlab)
return(node)
}
dnd <- dendrapply(dnd, rectify_labels, df = iris2)
## Create a color palette as a data.frame with one row for each spp
uniqspp <- as.character(unique(iris$Species))
colormap <- data.frame(Species = uniqspp, color = rainbow(n = length(uniqspp)))
colormap[, 2] <- c("red", "blue", "green")
colormap
## Now color the inner dendrogram edges
color_dendro <- function(node, colormap){
if(is.leaf(node)){
nodecol <- colormap$color[match(attr(node, "label"), colormap$Species)]
attr(node, "nodePar") <- list(pch = NA, lab.col = nodecol)
attr(node, "edgePar") <- list(col = nodecol)
}else{
spp <- attr(node, "label")
dominantspp <- levels(spp)[which.max(tabulate(spp))]
edgecol <- colormap$color[match(dominantspp, colormap$Species)]
attr(node, "edgePar") <- list(col = edgecol)
}
return(node)
}
dnd <- dendrapply(dnd, color_dendro, colormap = colormap)
## Plot the dendrogram
plot(dnd)
The function you are looking for is color_brances from the dendextend R package, using the arguments clusters and col. Here is an example (based on Shaun Wilkinson's example):
library(cluster)
set.seed(999)
iris2 <- iris[sample(x = 1:150,size = 50,replace = F),]
clust <- diana(iris2)
dend <- as.dendrogram(clust)
temp_col <- c("red", "blue", "green")[as.numeric(iris2$Species)]
temp_col <- temp_col[order.dendrogram(dend)]
temp_col <- factor(temp_col, unique(temp_col))
library(dendextend)
dend %>% color_branches(clusters = as.numeric(temp_col), col = levels(temp_col)) %>%
set("labels_colors", as.character(temp_col)) %>%
plot
there are suspicions that misunderstood the question however I'll try to answer:
from my previous objectives were rewritten by the example of iris
clrs <- rainbow(n = 3) # create palette
clrs <- clrs[iris$Species] # assign colors
plot(x = iris$Sepal.Length,y = iris$Sepal.Width,col=clrs) # simple test colors
# cluster
dt <- cbind(iris,clrs)
dt <- dt[sample(x = 1:150,size = 50,replace = F),] # create short dataset for visualization convenience
empty.labl <- gsub("."," ",dt$Species) # create a space vector with length of names intended for reserve place to future text labels
dst <- dist(x = scale(dt[,1:4]),method = "manhattan")
hcl <- hclust(d = dst,method = "complete")
plot(hcl,hang=-1,cex=1,labels = empty.labl, xlab = NA,sub=NA)
dt <- dt[hcl$order,] # sort rows for order objects in dendrogramm
text(x = seq(nrow(dt)), y=-.5,labels = dt$Species,srt=90,cex=.8,xpd=NA,adj=c(1,0.7),col=as.character(dt$clrs))

Is dplyr's left_join correct way to attach a data.frame to a SpatialPolygonDataFrame in R?

Merging extra data (frames) to spatial objects in R can be tricky (as explained here, or here)
Searching for a solution on how to correctly do the job I found this SO question listing several methods. dplyr's left_join was not listed there. I spotted it being used in Robin's tutorial.
My question is - is this a correc method to use? Are there any use cases (different number of rows? different rows names? sorting? etc.) that this solution would fail?
Here is some reproducible code illustarting the methods I found / came across:
# libraries
library("spdep"); library("sp"); library("dplyr")
# sopatial data
c <- readShapePoly(system.file("etc/shapes/columbus.shp", package="spdep")[1])
m <- c#data
c#data <- subset(c#data, select = c("POLYID", "INC"))
c#data$INC2 <- c#data$INC
c#data$INC <- NULL
ex <- subset(c, c$POLYID <= 2) # polygons with messed up data in merged df
c <- subset(c, c$POLYID < 49) # remove one polygon from shape so that df has one poly too many
# messing up merge data
m <- subset(m, POLYID != 1) # exclude polygon
m <- subset(m, select = c("POLYID", "INC")) # only two vars
rownames(m) <- m$POLYID - 2 # change rownames
m$POLYID[m$POLYID == 2] <- 0 # wrong ID
m <- m[order(m$INC),] # different sort
m$POLYID2 <- m$POLYID # duplicated to check dplyr
# left_join solution
s1 <- c
s1#data <- left_join(s1#data, m)
plot(c)
plot(s1, col = "red", density = 40, angle = 0, add = TRUE)
plot(ex, col= NA, border = "green", add = TRUE)
View(s1#data)
# match solution
s2 <- c
s2#data = data.frame(s2#data, m[match(s2#data[,"POLYID"], m[,"POLYID"]),])
plot(c)
plot(s2, col = "red", density = 40, angle = 0, add = TRUE)
plot(ex, col= NA, border = "green", add = TRUE)
View(s2#data)
# sp solution
s3 <- c
s3 <- sp::merge(s3, m, by="POLYID")
plot(c)
plot(s3, col = "red", density = 40, angle = 0, add = TRUE)
plot(ex, col= NA, border = "green", add = TRUE)
View(s3#data)
# inner join solution
s4 <- c
s4#data <- inner_join(s4#data, m)
plot(c)
plot(s4, col = "red", density = 40, angle = 0, add = TRUE)
plot(ex, col= NA, border = "green", add = TRUE)
View(s4#data)
# rebuild solution???
s5 <- c
s5.df <- as(s5, "data.frame")
s5.df1 <- merge(s5.df, m, sort=FALSE, by.x="POLYID", by.y="POLYID", all.x=TRUE, all.y=TRUE)
s51 <- SpatialPolygonsDataFrame(as(s5, "SpatialPolygons"), data=s5.df1)
plot(c)
plot(s51, col = "red", density = 40, angle = 0, add = TRUE)
plot(ex, col= NA, border = "green", add = TRUE)
Left join seems to do the job. Same as sp::merge and match ( I do hope there is no messing up the order so for instance plotted polygons are associated with different vales after the merge?). None of the solutions actually removes two polygons withmissing data, but I presume this is correct behaviour in R?

Resources