I am trying to generate a heatmap as the following figure. I have already tried pheatmap and the code is as follows:
breaks_2 <- seq(min(0), max(2), by = 0.1)
pheatmap::pheatmap(
mat = data,
cluster_cols = F,
cluster_rows = F,
scale = "column",
border_color = "white",
color = inferno(20),
show_colnames = TRUE,
show_rownames = FALSE,
breaks = breaks_2
)
But this does not seem to work. So far I am understanding I am mistaking with defining break or have to use another package than pheatmap. Any suggestion will be really helpful.
The color scale in pheatmap adjusts to the range of the input data. If you want anything above a certain value to be coloured daffodil, then simply send pheatmap a copy of your data with the highest values rounded to 2.
Suppose you have a data frame like this, with values anywhere between 0 and 3:
set.seed(1)
data <- as.data.frame(matrix(runif(64, 0, 3), nrow = 8))
names(data) <- LETTERS[1:8]
data
#> A B C D E F G H
#> 1 0.7965260 1.8873421 2.1528555 0.801662 1.4806239 2.46283888 2.1969412 0.9488151
#> 2 1.1163717 0.1853588 2.9757183 1.158342 0.5586528 1.94118058 2.0781947 1.5559028
#> 3 1.7185601 0.6179237 1.1401055 0.040171 2.4821200 2.34879829 1.4328589 1.9860152
#> 4 2.7246234 0.5296703 2.3323357 1.147164 2.0054002 1.65910893 2.5836284 1.2204906
#> 5 0.6050458 2.0610685 2.8041157 2.609073 2.3827196 1.58915874 1.3142913 2.7386278
#> 6 2.6951691 1.1523112 0.6364276 1.021047 0.3238309 2.36806870 0.7343918 0.8808101
#> 7 2.8340258 2.3095243 1.9550213 1.446240 2.1711328 0.06999361 0.2120371 1.3771972
#> 8 1.9823934 1.4930977 0.3766653 1.798697 1.2338233 1.43169020 0.2983985 0.9971840
Some of the values are greater than two. We want all of these to appear the same colour on our heatmap, so we create a copy of our data for plotting, and round down all of the values that were greater than 2 to be exactly 2:
data_2 <- data
data_2[] <- lapply(data_2, function(x) { x[x > 2] <- 2; x })
So now if we run pheatmap on data_2, we see that all the values that were greater than 2 in our original data frame are coloured daffodil.
library(viridis)
library(pheatmap)
breaks_2 <- seq(0, 2, by = 0.1)
pheatmap(
mat = data_2,
cluster_cols = F,
cluster_rows = F,
border_color = "white",
scale = 'none',
color = inferno(22),
show_colnames = TRUE,
show_rownames = FALSE,
legend_breaks = breaks_2
)
Related
I have data that look like this:
Gene
HBEC-KT-01
HBEC-KT-02
HBEC-KT-03
HBEC-KT-04
HBEC-KT-05
Primarycells-02
Primarycells-03
Primarycells-04
Primarycells-05
BPIFB1
15726000000
15294000000
15294000000
14741000000
22427000000
87308000000
2.00E+11
1.04E+11
1.51E+11
LCN2
18040000000
26444000000
28869000000
30337000000
10966000000
62388000000
54007000000
56797000000
38414000000
C3
2.52E+11
2.26E+11
1.80E+11
1.80E+11
1.78E+11
46480000000
1.16E+11
69398000000
78766000000
MUC5AC
15647000
8353200
12617000
12221000
29908000
40893000000
79830000000
28130000000
69147000000
MUC5B
965190000
693910000
779970000
716110000
1479700000
38979000000
90175000000
41764000000
50535000000
ANXA2
14705000000
18721000000
21592000000
18904000000
22657000000
28163000000
24282000000
21708000000
16528000000
I want to make a heatmap like the following using R. I am following a paper and they quoted "Heat maps were generated with the ‘pheatmap’ package76, where correlation clustering distance row was applied". Here is their heatmap.
I want the same like this and I am trying to make one using R by following tutorials but I am new to R language and know nothing about R.
Here is my code.
df <- read.delim("R.txt", header=T, row.names="Gene")
df_matrix <- data.matrix(df)
pheatmap(df_matrix,
main = "Heatmap of Extracellular Genes",
color = colorRampPalette(rev(brewer.pal(n = 10, name = "RdYlBu")))(10),
cluster_cols = FALSE,
show_rownames = F,
fontsize_col = 10,
cellwidth = 40,
)
This is what I get.
When I try using clustering, I got the error.
pheatmap(
mat = df_matrix,
scale = "row",
cluster_column = F,
show_rownames = TRUE,
drop_levels = TRUE,
fontsize = 5,
clustering_method = "complete",
main = "Hierachical Cluster Analysis"
)
Error in hclust(d, method = method) :
NA/NaN/Inf in foreign function call (arg 10)
Can someone help me with the code?
You can normalize the data using scale to archive a more uniform coloring. Here, the mean expression is set to 0 for each sample. Genes lower expressed than average have a negative z score:
library(tidyverse)
library(pheatmap)
data <- tribble(
~Gene, ~`HBEC-KT-01`, ~`HBEC-KT-02`, ~`HBEC-KT-03`, ~`HBEC-KT-04`, ~`HBEC-KT-05`, ~`Primarycells-03`, ~`Primarycells-04`, ~`Primarycells-05`,
"BPIFB1", 1.5726e+10, 1.5294e+10, 1.5294e+10, 1.4741e+10, 2.2427e+10, 2e+11, 1.04e+11, 1.51e+11,
"LCN2", 1.804e+10, 2.6444e+10, 2.8869e+10, 3.0337e+10, 1.0966e+10, 5.4007e+10, 5.6797e+10, 3.8414e+10,
"C3", 2.52e+11, 2.26e+11, 1.8e+11, 1.8e+11, 1.78e+11, 1.16e+11, 6.9398e+10, 7.8766e+10,
"MUC5AC", 15647000, 8353200, 12617000, 12221000, 29908000, 7.983e+10, 2.813e+10, 6.9147e+10,
"MUC5B", 965190000, 693910000, 779970000, 716110000, 1479700000, 9.0175e+10, 4.1764e+10, 5.0535e+10,
"ANXA2", 1.4705e+10, 1.8721e+10, 2.1592e+10, 1.8904e+10, 2.2657e+10, 2.4282e+10, 2.1708e+10, 1.6528e+10
)
data %>%
mutate(across(where(is.numeric), scale)) %>%
column_to_rownames("Gene") %>%
pheatmap(
scale = "row",
cluster_column = F,
show_rownames = FALSE,
show_colnames = TRUE,
treeheight_col = 0,
drop_levels = TRUE,
fontsize = 5,
clustering_method = "complete",
main = "Hierachical Cluster Analysis (z-score)",
)
Created on 2021-09-26 by the reprex package (v2.0.1)
I'm trying to make a multipanel figure with networks in the igraph package. I'd like 2 rows, each with 3 networks. I need to be able to save the figure as a PNG and I'd like to label them each A:F in one of the corners. I've tried to do this in a loop but only one network appears in the figures. I need the V(nw)$x<- y and E(nw)$x<- y code in the loop to make my networks come out properly. My networks are in a list().
I've made a small sample of the code I've tried, I would like to avoid doing it without a loop if I can. Thanks in advance.
srs_1nw <- graph("Zachary")
srs_2nw <- graph("Heawood")
srs_3nw <- graph("Folkman")
srs_1c <- cluster_fast_greedy(srs_1nw)
srs_2c <- cluster_fast_greedy(srs_2nw)
srs_3c <- cluster_fast_greedy(srs_3nw)
listofsrs_nws <- list(srs_1nw,srs_2nw,srs_3nw)
listofsrs_cs <- list(srs_1c,srs_2c,srs_3c)
colours <- c("red","blue","green","yellow")
par(mfrow=c(2,3))
for (i in length(listofsrs_nws)) {
c<-listofsrs_cs[[i]]
nw<-listofsrs_nws[[i]]
V(nw)$size <- log(strength(nw))*6 # weighted nodes
E(nw)$arrow.size <- 2 # arrow size
c.colours <- colours[membership(c)]
plot(c, nw, col = c.colours,
mark.col = adjustcolor(colours, alpha.f = 0.4),
mark.border = adjustcolor(colours, alpha.f = 1),
vertex.frame.width = 5, edge.curved = .15)
}
We can use mapply like below
mapply(function(c, nw) {
V(nw)$size <- log(strength(nw)) * 6 # weighted nodes
E(nw)$arrow.size <- 2 # arrow size
c.colours <- colours[membership(c)]
plot(c, nw,
col = c.colours,
mark.col = adjustcolor(colours, alpha.f = 0.4),
mark.border = adjustcolor(colours, alpha.f = 1),
vertex.frame.width = 5, edge.curved = .15
)
}, listofsrs_cs, listofsrs_nws)
I made the heatmap using the code below:
library(pheatmap)
library(dplyr)
data = data.frame(matrix(runif(10*10), ncol=10))
data$sample = rep(c("tumour", "normal"), 5)
data$subject.ID = paste('Subject', 1:10)
data = data %>% arrange(sample)
# for row annotation
my_sample_col = data %>% select(sample)
rownames(my_sample_col) = data$subject.ID
# data matrix
mat = as.matrix(data %>% select(-sample, -subject.ID))
rownames(mat) = data$subject.ID
pheatmap(mat,
scale='row',
annotation_row = my_sample_col,
annotation_names_row=F,
cluster_rows = FALSE,
cluster_cols = FALSE,
show_colnames = FALSE,
show_rownames = FALSE)
I want to put a gap between row 5 and row 6, to separate the heatmap according to my row annotation.
In pheatmap function, the argument gaps_row seems to do the job.
vector of row indices that show shere to put gaps into heatmap. Used only if the rows are not clustered.
I'm not sure how to implement that. Can someone help me with this? Thanks a lot.
I would recommend using ComplexHeatmap package (website; Gu et al, 2016). You can install it with devtools::install_github("jokergoo/ComplexHeatmap").
It has more functionalities, but you also have to invest more time (eg., row annotation and matrix scaling).
library(ComplexHeatmap)
# Create annotation for rows
my_sample_col_ano <- rowAnnotation(sample = my_sample_col$sample,
show_annotation_name = FALSE)
# Scale original matrix row-wise
matS <- t(apply(mat, 1, scale))
# Plot heatmap
Heatmap(matS,
# Remove name from fill legend
name = "",
# Keep original row/col order
row_order = rownames(matS), column_order = colnames(matS),
# Add left annotation (legend with tumor/normal)
left_annotation = my_sample_col_ano,
# ACTUAL SPLIT by sample group
row_split = my_sample_col$sample,
show_row_names = FALSE, show_column_names = FALSE,
show_row_dend = FALSE, show_column_dend = FALSE,
row_title = NULL)
If you want to use original pheatmap pass argument to gaps_row which is equal to the size of your group (ie, normal):
pheatmap(mat,
scale='row',
gaps_row = 5,
annotation_row = my_sample_col,
annotation_names_row=F,
cluster_rows = FALSE,
cluster_cols = FALSE,
show_colnames = FALSE,
show_rownames = FALSE)
If you can more groups than two instead of hardcoding numeric value to gaps_row (ie, gaps_row = 5) you can pass this snippet (head(as.numeric(cumsum(table(my_sample_col$sample))), -1)).
Given a data frame containing mixed variables (i.e. both categorical and continuous) like,
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
I perform unsupervised feature selection using the package FactoMineR
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
The variable df.princomp is a list.
Thereafter, to visualize the principal components I use
fviz_screeplot() and fviz_contrib() like,
#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
which gives the following Fig1
and Fig2
Explanation of Fig1: The Fig1 is a scree plot. A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 43.8% of total variance. The question now naturally arises, "What are these variables?". This I have shown in Fig2.
Explanation of Fig2: This figure visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, name, studLoc and finalMark are the most important variables that can be used for further analysis.
Further Analysis- where I'm stuck at: To derive the contribution of the aforementioned variables name, studLoc, finalMark. I use the principal component variable df.princomp (see above) like df.princomp$quanti.var$contrib[,4]and df.princomp$quali.var$contrib[,2:3].
I've to manually specify the column indices [,2:3] and [,4].
What I want: I want to know how to do dynamic column index assignment, such that I do not have to manually code the column index [,2:3] in the list df.princomp?
I've already looked at the following similar questions 1, 2, 3 and 4 but cannot find my solution? Any help or suggestions to solve this problem will be helpful.
Not sure if my interpretation of your question is correct, apologies if not. From what I gather you are using PCA as an initial tool to show you what variables are the most important in explaining the dataset. You then want to go back to your original data, select these variables quickly without manual coding each time, and use them for some other analysis.
If this is correct then I have saved the data from the contribution plot, filtered out the variables that have the greatest contribution, and used that result to create a new data frame with these variables alone.
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than, say, 20
r<-rownames(dat[dat$contrib>20,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
#finalmark name studLoc
#1 53 b POTYQ0002N
#2 73 i LWMTW1195I
#3 95 d VTUGO1685F
#4 39 f YCGGS5755N
#5 97 c GOSWE3283C
#6 58 g APBQD6181U
#7 67 a VUJOG1460V
#8 64 h YXOGP1897F
#9 15 j NFUOB6042V
#10 81 e QYTHG0783G
Based on your comment, where you said you wanted to 'Find variables with value greater than 5 in Dim.1 AND Dim.2 and save these variables to a new data frame', I would do this:
#top contributors to both Dim 1 and 2
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1,2), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than 5
r<-rownames(dat[dat$contrib>5,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
(This keeps all the original variables in our new data frame since they all contributed more than 5% to the total variance)
There are a lot of ways to extract contributions of individual variables to PCs. For numeric input, one can run a PCA with prcomp and look at $rotation (I spoke to soon and forgot you've got factors here so prcomp won't work directly). Since you are using factoextra::fviz_contrib, it makes sense to check how that function extracts this information under the hood. Key factoextra::fviz_contrib and read the function:
> factoextra::fviz_contrib
function (X, choice = c("row", "col", "var", "ind", "quanti.var",
"quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue",
color = "steelblue", sort.val = c("desc", "asc", "none"),
top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(),
...)
{
sort.val <- match.arg(sort.val)
choice = match.arg(choice)
title <- .build_title(choice[1], "Contribution", axes)
dd <- facto_summarize(X, element = choice, result = "contrib",
axes = axes)
contrib <- dd$contrib
names(contrib) <- rownames(dd)
theo_contrib <- 100/length(contrib)
if (length(axes) > 1) {
eig <- get_eigenvalue(X)[axes, 1]
theo_contrib <- sum(theo_contrib * eig)/sum(eig)
}
df <- data.frame(name = factor(names(contrib), levels = names(contrib)),
contrib = contrib)
if (choice == "quanti.var") {
df$Groups <- .get_quanti_var_groups(X)
if (missing(fill))
fill <- "Groups"
if (missing(color))
color <- "Groups"
}
p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill,
color = color, sort.val = sort.val, top = top, main = title,
xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt,
ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib,
linetype = 2, color = "red")
p
}
<environment: namespace:factoextra>
So it's really just calling facto_summarize from the same package. By analogy you can do the same thing, simply call:
> dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1)
> dd
name contrib
ID ID 0.9924561
finalmark finalmark 21.4149175
subj1mark subj1mark 7.1874438
subj2mark subj2mark 16.6831560
name name 26.8610132
studLoc studLoc 26.8610132
And that's the table corresponding to your figure 2. For PC2 use axes = 2 and so on.
Regarding "how to programmatically determine the column indices of the PCs", I'm not 100% sure I understand what you want, but if you just want to say for column "finalmark", grab its contribution to PC3 you can do the following:
library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")
# get the contribution of column 'finalmark' by name
contribution_df %>%
filter(name == "finalmark")
# get the contribution of column 'finalmark' to PC3
contribution_df %>%
filter(name == "finalmark" & PC == 3)
# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib
BTW I think ID in your example is treated as numeric instead of factor, but since it's just an example I'm not bothering with it.
I have the following scripts:
library("gplots")
mydata <- mtcars
mydata.nr <- nrow(mydata)
mydata.newval <- data.frame(row.names=rownames(mydata),new.val=-log(runif(mydata.nr)))
# Functions
hclustfunc <- function(x) hclust(x, method="complete")
distfunc <- function(x) dist(x,method="euclidean")
# Set colors
hmcols <- rev(redgreen(256));
# Plot the scaled data
heatmap.2(as.matrix(mydata),dendrogram="row",scale="row",col=hmcols,trace="none", margin=c(8,9), hclust=hclustfunc,distfun=distfunc);
Which generate the following heatmap:
Now given a new data.frame which contain new values for each cars:
mydata.nr <- nrow(mydata)
mydata.newval <- data.frame(row.names=rownames(mydata),new.val=-log(runif(mydata.nr)))
I want to create a single column heatmap with gradient gray positioned next to row names.
How can I achieve that in R heatmap.2?
Does this do what you want? You can use the RowSideColors option to add a column to the side of the heatmap.
new.vals = mydata.newval[,1]
mydata.newval$scaled = ( new.vals - min(new.vals) ) /
( max(new.vals) - min(new.vals) )
mydata.newval$gray = gray( mydata.newval$scaled )
heatmap.2( as.matrix(mydata),
dendrogram = "row", scale = "row",
col = hmcols, trace = "none",
margin = c(8,9),
hclust = hclustfunc, distfun = distfunc,
RowSideColors=mydata.newval$gray )
If you want the gray column in between the heatmap and the labels, there isn't a simple
way to do that with heatmap.2; I don't think it was designed for
such purposes. One way to hack it together would be to make the gray values
go from 10 to 11 (or something out of the range of the rest of the data). Then
you would change the colors mapped to the breaks (see here). However, this
would make your key look pretty funky.
# heatmap.2 does the clustering BEFORE the scaling.
# Clustering after scaling might give different results
# heatmap.2 also reorders the dendrogram according to rowMeans.
# (Code copied directly from the heatmap.2 function)
x = as.matrix(mydata)
Rowv = rowMeans(x, na.rm = TRUE)
hcr = hclustfunc(distfunc(x))
ddr = as.dendrogram(hcr)
ddr = reorder(ddr, Rowv) # the row dendrogram
# Scale the data as heatmap.2 does
rm = rowMeans(x, na.rm = TRUE)
x = sweep(x, 1, rm)
sx = apply(x, 1, sd, na.rm = TRUE)
x = sweep(x, 1, sx, "/")
# add the new data as a column
new.vals = mydata.newval[,1]
new.vals.scaled = ( new.vals - min(new.vals) ) /
( max(new.vals) - min(new.vals) ) # scaled from 0 to 1
x = cbind( x, gray = max(x) + new.vals.scaled + 0.1 )
# make the custom breaks and colors
edge = max(abs(x-1.1))
breaks = seq(-edge,edge+1.1,length.out=1000)
gradient1 = greenred( sum( breaks[-length(breaks)] <= edge ) )
gradient2 = colorpanel( sum( breaks[-length(breaks)] > edge ), "white", "black" )
hm.colors = c(gradient1,gradient2)
hm = heatmap.2( x, col=hm.colors, breaks=breaks,
scale="none",
dendrogram="row", Rowv=ddr,
trace="none", margins=c(8,9) )
Although this hack works, I would look for a more robust solution using more flexible packages that play with different viewports using the grid package.