I am currently generating heatmaps in R using the ggplot function. In the code below.. I first read the data into a dataframe, remove any duplicate rows, factorise timestamp field, melt the dataframe (according to 'timestamp'), scale all variable between 0 and 1, then plot the heatmap.
In the resulting heatmap, time is plotted on the x axis and each iostat-sda variable (see sample data below) is plotted along the y axis. Note: If you want to try out the R code – you can paste the sample data below into a file called iostat-sda.csv.
however I really need to be able cluster the rows within this heatmap... anyone know how this can be achieved using the ggplot function?
Any help would be very much appreciated!!
############################## The code
library(ggplot2)
fileToAnalyse_f <- read.csv(file="iostat-sda.csv",head=TRUE,sep=",")
fileToAnalyse <- subset(fileToAnalyse, !duplicated(timestamp))
fileToAnalyse[,1]<-factor(fileToAnalyse[,1])
fileToAnalyse.m <- melt(fileToAnalyse, id=1)
fileToAnalyse.s <- ddply(fileToAnalyse.m, .(variable), transform, rescale = rescale(value) ) #scales each variable between 0 and 1
base_size <- 9
ggplot(fileToAnalyse.s, aes(timestamp, variable)) + geom_tile(aes(fill = rescale), colour = "black") + scale_fill_gradient(low = "black", high = "white") + theme_grey(base_size = base_size) + labs(x = "Time", y = "") + opts(title = paste("Heatmap"),legend.position = "right", axis.text.x = theme_blank(), axis.ticks = theme_blank()) + scale_y_discrete(expand = c(0, 0)) + scale_x_discrete(expand = c(0, 0))
########################## Sample data from iostat-sda.csv
timestamp,DSKRRQM,DSKWRQM,DSKR,DSKW,DSKRMB,DSKWMB,DSKARQS,DSKAQUS,DSKAWAIT,DSKSVCTM,DSKUtil
1319204905,0.33,0.98,10.35,2.37,0.72,0.02,120.00,0.01,0.40,0.31,0.39
1319204906,1.00,4841.00,682.00,489.00,60.09,40.68,176.23,2.91,2.42,0.50,59.00
1319204907,0.00,1600.00,293.00,192.00,32.64,13.89,196.45,5.48,10.76,2.04,99.00 1319204908,0.00,3309.00,1807.00,304.00,217.39,26.82,236.93,4.84,2.41,0.45,96.00
1319204909,0.00,5110.00,93.00,427.00,0.72,43.31,173.43,4.43,8.67,1.90,99.00
1319204910,0.00,6345.00,115.00,496.00,0.96,52.25,178.34,4.00,6.32,1.62,99.00
1319204911,0.00,6793.00,129.00,666.00,1.33,57.22,150.83,4.74,6.16,1.26,100.00
1319204912,0.00,6444.00,115.00,500.00,0.93,53.06,179.77,4.20,6.83,1.58,97.00
1319204913,0.00,1923.00,835.00,215.00,78.45,16.68,185.55,4.81,4.58,0.91,96.00
1319204914,0.00,0.00,788.00,0.00,83.51,0.00,217.04,0.45,0.57,0.25,20.00
1319204915,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1319204916,0.00,4.00,2.00,4.00,0.01,0.04,17.67,0.00,0.00,0.00,0.00
1319204917,0.00,8.00,4.00,8.00,0.02,0.09,17.83,0.00,0.00,0.00,0.00
1319204918,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1319204919,0.00,2.00,113.00,4.00,11.96,0.03,209.93,0.06,0.51,0.43,5.00
1319204920,0.00,59.00,147.00,54.00,11.15,0.63,120.02,0.04,0.20,0.15,3.00
1319204921,1.00,19.00,57.00,18.00,4.68,0.20,133.47,0.07,0.93,0.67,5.00
There is a nice package called NeatMap which simplifies generating heatmaps in ggplot2. Some of the row clustering methods include Multidimensional Scaling, PCA, or hierarchical clustering. Things to watch out for are:
Data to make.heatmap1 has to be in wide format
Data has to be a matrix, not a dataframe
Assign rownames to the wide-format matrix before generating the plot
I've changed your code slightly to avoid naming variables the same as base functions (i.e. rescale)
fileToAnalyse.s <- ddply(fileToAnalyse.m, .(variable), transform, rescale.x = rescale(value) ) #scales each variable between 0 and 1
fileToAnalyse.w <- dcast(fileToAnalyse.s, timestamp ~ variable, value_var="rescale.x")
rownames(fileToAnalyse.w) <- as.character(fileToAnalyse.w[, 1])
ggheatmap <- make.heatmap1(as.matrix(fileToAnalyse.w[, -1]), row.method = "complete.linkage", row.metric="euclidean", column.cluster.method ="none", row.labels = rownames(fileToAnalyse.w))
+scale_fill_gradient(low = "black", high = "white") + labs(x = "Time", y = "") + opts(title = paste("Heatmap")
Related
I am trying to create a density plot for particle size data. My data has multiple density and size readings for each genotype set. Is there a way to specify multiple columns into x and y using ggplot? I tried coding for this but am only getting a blank plot as of now. This is the link to the csv file I used: https://drive.google.com/file/d/11djXTmZliPCGLCZavukjb0TT28HsKMRQ/view?usp=sharing
Thanks!
crop.data6 <- read.csv("barleygt25.csv", header = TRUE)
crop.data6
library(ggplot2)
plot1 = ggplot(data=crop.data6, aes(x=, xend=bq, y=a, yend=bq, color=genotype))
plot1
Your data is in a strange format that doesn't lend itself well to plotting. Effectively, it needs to be transposed then pivoted into long format to make it suitable for plotting:
df <- data.frame(xvals = c(t(crop.data6[1:9, -c(1:2)])),
yvals = c(t(crop.data6[10:18, -c(1:2)])),
genotype = rep(crop.data6$genotype[1:9], each = 68))
ggplot(df, aes(xvals, yvals, color = genotype)) +
geom_line(size = 1) +
scale_color_brewer(palette = "Set1") +
theme_bw(base_size = 16) +
labs(x = "value", y = "density")
I found how to estimate the historical Variance Decomposition for VAR models in R in the below link
Historical Variance Error Decompotision Daniel Ryback
Daniel Ryback presents the result in an excel plot, but I wanted to prepare it with ggplot so I created some lines to get it, nevertheless, the plot I got in ggplot is very different to the one showed by Daniel in Excel. I replicated in excel and got the same result than Daniel so it seems there is an error in the way I am preparing the ggplot. Does anyone have a suggestion to arrive to the excel result?
See below my code
library(vars)
library(ggplot2)
library(reshape2)
this code is run after runing the code developed by Daniel Ryback in the link above to define the HD function
data(Canada)
ab<-VAR(Canada, p = 2, type = "both")
HD <- VARhd(Estimation=ab)
HD[,,1]
ex <- HD[,,1]
ex1 <- as.data.frame(ex) # transforming the HD matrix as data frame #
ex2 <- ex1[3:84,1:4] # taking our the first 2 rows as they are N/As #
colnames(ex2) <- c("Emplyment", "Productivity", "Real Wages", "Unemplyment") # renaming columns #
ex2$Period <- 1:nrow(ex2) # creating an id column #
col_id <- grep("Period", names(ex2)) # setting the new variable as id #
ex3 <- ex2[, c(col_id, (1:ncol(ex2))[-col_id])] # moving id variable to the first column #
molten.ex <- melt(ex3, id = "Period") # melting the data frame #
ggplot(molten.ex, aes(x = Period, y = value, fill = variable)) +
geom_bar(stat = "identity") +
guides(fill = guide_legend(reverse = TRUE))
ggplot version
Excel version
The difference is that ggplot2 is ordering the variable factor and plotting it in a different order than excel. If you reorder the factor before plotting it will put 'unemployment' at the bottom and 'employment' at the top, as in excel:
molten.ex$variable <- factor(molten.ex$variable, levels = c("Unemployment",
"Real Wages",
"Productivity",
"Employment"))
ggplot(molten.ex, aes(x = Period, y = value, fill = variable)) +
geom_bar(stat = "identity", width = 0.6) +
guides(fill = guide_legend(reverse = TRUE)) +
# Making the R plot look more like excel for comparison...
scale_y_continuous(limits = c(-6,8), breaks = seq(-6,8, by = 2)) +
scale_fill_manual(name = NULL,
values = c(Unemployment = "#FFc000", # yellow
`Real Wages` = "#A4A4A4", # grey
Productivity = "#EC7C30", # orange
Employment = "#5E99CE")) + # blue
theme(rect = element_blank(),
panel.grid.major.y = element_line(colour = "#DADADA"),
legend.position = "bottom",
axis.ticks = element_blank(),
axis.title = element_blank(),
legend.key.size = unit(3, "mm"))
Giving:
To roughly match the excel graph in Daniel Ryback's post:
I have a data.frame that I'm trying to plot in a facetted manner with R's ggplot's geom_boxplot:
set.seed(1)
vals <- rnorm(12)
min.vals <- vals-0.5
low.vals <- vals-0.25
max.vals <- vals+0.5
high.vals <- vals+0.25
df <- data.frame(sample=c("c0.A_1","c0.A_2","c1.A_1","c1.A_2","c2.A_1","c2.A_2","c0.B_1","c0.B_2","c1.B_1","c1.B_2","c2.B_1","c2.B_2"),
replicate=rep(c(1,2),6),val=vals,min.val=min.vals,low.val=low.vals,max.val=max.vals,high.val=high.vals,
group=c(rep("A",6),rep("B",6)),cycle=rep(c("c0","c0","c1","c1","c2","c2"),2),
stringsAsFactors = F)
In this example there are two factors which I'd like to facet:
facet.factors <- c("group","cycle")
for(f in 1:length(facet.factors)) df[,facet.factors[f]] <- factor(df[,facet.factors[f]],levels=unique(df[,facet.factors[f]]))
levels.vec <- sapply(facet.factors,function(f) length(levels(df[,f])))
But in other cases I may have only one or more than two factors.
Is there a way to pass to facet_wrap the vector of factors by which to facet and the number of columns?
Here's what I tried, where in addition I created my own colors for each factor level:
library(RColorBrewer,quietly=T)
library(scales,quietly=T)
level.colors <- brewer.pal(sum(levels.vec),"Set2")
require(ggplot2)
ggplot(df,aes_string(x="replicate",ymin="min.val",lower="low.val",middle="val",upper="high.val",ymax="max.val",col=facet.factors,fill=facet.factors))+
geom_boxplot(position=position_dodge(width=0),alpha=0.5,stat="identity")+
facet_wrap(~facet.factors,ncol=max(levels.vec))+
labs(x="Replicate",y="Val")+
scale_x_continuous(breaks=unique(df$replicate))+
scale_color_manual(values=level.colors,labels=unname(unlist(sapply(facet.factors,function(f) levels(df[,f])))),name="factor level")+scale_fill_manual(values=level.colors,labels=unname(unlist(sapply(facet.factors,function(f) levels(df[,f])))),name="factor level")+
theme_bw()+theme(legend.position="none",panel.border=element_blank(),strip.background=element_blank(),axis.title=element_text(size=8))
which obviously throws this error:
Error in combine_vars(data, params$plot_env, vars, drop = params$drop) :
At least one layer must contain all variables used for facetting
Clearly this works:
ggplot(df,aes_string(x="replicate",ymin="min.val",lower="low.val",middle="val",upper="high.val",ymax="max.val",col=facet.factors,fill=facet.factors))+
geom_boxplot(position=position_dodge(width=0),alpha=0.5,stat="identity")+
facet_wrap(group~cycle,ncol=max(levels.vec))+
labs(x="Replicate",y="Val")+
scale_x_continuous(breaks=unique(df$replicate))+
scale_color_manual(values=level.colors,labels=unname(unlist(sapply(facet.factors,function(f) levels(df[,f])))),name="factor level")+scale_fill_manual(values=level.colors,labels=unname(unlist(sapply(facet.factors,function(f) levels(df[,f])))),name="factor level")+
theme_bw()+theme(legend.position="none",panel.border=element_blank(),strip.background=element_blank(),axis.title=element_text(size=8))
But it ignores the colors I'm passing and doesn't add the legend, I imagine since I cannot pass a vector to col and fill in aesthetics, and clearly I have to hard code the facetting.
This doesn't work either for the facetting problem:
ggplot(df,aes_string(x="replicate",ymin="min.val",lower="low.val",middle="val",upper="high.val",ymax="max.val",col=facet.factors,fill=facet.factors))+
geom_boxplot(position=position_dodge(width=0),alpha=0.5,stat="identity")+
facet_wrap(facet.factors[1]~facet.factors[2],ncol=max(levels.vec))+
labs(x="Replicate",y="Val")+
scale_x_continuous(breaks=unique(df$replicate))+
scale_color_manual(values=level.colors,labels=unname(unlist(sapply(facet.factors,function(f) levels(df[,f])))),name="factor level")+scale_fill_manual(values=level.colors,labels=unname(unlist(sapply(facet.factors,function(f) levels(df[,f])))),name="factor level")+
theme_bw()+theme(legend.position="none",panel.border=element_blank(),strip.background=element_blank(),axis.title=element_text(size=8))
So my questions are:
1. Is there a way to pass a vector to facet_wrap?
2. Is there a way to color and fill by a vector of factors rather by single ones?
We cannot specify two colors for coloring/filling to a single box, I suggested that the faceting variables be pasted together as coloring/filling scale:
df$col.fill <- Reduce(paste, df[facet.factors])
facets of facet_wrap accepts both character vector or a one sided formula:
facet.formula <- as.formula(paste('~', paste(facet.factors, collapse = '+')))
So the code finally looks like this:
ggplot(df,
aes_string(
x = "replicate", ymin = "min.val", ymax = "max.val",
lower = "low.val", middle = "val", upper = "high.val",
col = "col.fill", fill = "col.fill"
)) +
geom_boxplot(position = position_dodge(width = 0),
alpha = 0.5,
stat = "identity") +
facet_wrap(facet.factors, ncol = max(levels.vec)) +
# alternatively: facet_wrap(facet.formula, ncol = max(levels.vec)) +
labs(x = "Replicate", y = "Val") +
scale_x_continuous(breaks = unique(df$replicate)) +
theme_bw() +
theme(
#legend.position = "none",
panel.border = element_blank(),
strip.background = element_blank(),
axis.title = element_text(size = 8)
)
The legend is not displayed because you added legend.position = "none",.
BTW, it would definitely improve readibility if you add some space and line break in you code.
I have a data.frame that looks something like this:
HSP90AA1 SSH2 ACTB TotalTranscripts
ESC_11_TTCGCCAAATCC 8.053308 12.038484 10.557234 33367.23
ESC_10_TTGAGCTGCACT 9.430003 10.687959 10.437068 30285.41
ESC_11_GCCGCGTTATAA 7.953726 9.918988 10.078192 30133.94
ESC_11_GCATTCTGGCTC 11.184402 11.056144 8.316846 24857.07
ESC_11_GTTACATTTCAC 11.943733 11.004500 9.240883 23629.00
ESC_11_CCGTTGCCCCTC 7.441695 9.774733 7.566619 22792.18
The TotalTranscripts column is sorted in descending order. What I'd like to do is generate three bar graphs using ggplot2 with each bar graph corresponding to each column of the data.frame with the exception of TotalTranscripts. I'd like the bar graphs to be ordered by TotalTranscripts just as the data.frame. I would be ideal to have these bar graphs on one plot using a facet wrap.
Any help would be greatly appreciated! Thank you!
EDIT: Here is my current code using barplot().
cells = "ESC"
genes = c("HSP90AA1", "SSH2", "ACTB")
g = data[genes,grep(cells, colnames(data))]
g = data.frame(t(g), colSums(data)[grep(cells, colnames(data))])
colnames(g)[ncol(g)] = "TotalTranscripts"
g = g[order(g$TotalTranscripts, decreasing=T), , drop=F]
barplot(as.matrix(g[1]), beside=TRUE, names.arg=paste(rownames(g)," (",g$TotalTranscripts,")",sep=""), las=2, col="light blue", cex.names=0.3, main=paste(colnames(g)[1], "\nCells sorted by total number of transcripts (colSums)", sep=""))
This will generate a plot that looks like this.
Again, the problem I seem to be having here is how to have multiple of these plots on the same image. I would like to add 20+ columns to this data.frame but I've cut this down to 3 for the sake of simplicity.
EDIT: Current code incorporating the answer below
cells = "ESC"
genes = rownames(data[x,])[1:8]
# genes = c("HSP90AA1", "SSH2", "ACTB")
g = data[genes,grep(cells, colnames(data))]
g = data.frame(t(g), colSums(data)[grep(cells, colnames(data))])
colnames(g)[ncol(g)] = "TotalTranscripts"
g = g[order(g$TotalTranscripts, decreasing=T), , drop=F]
g$rowz <- row.names(g)
g$Cells <- reorder(g$rowz, rev(g$TotalTranscripts))
df1 <- melt(g, id.vars = c("Cells", "TotalTranscripts"), measure.vars=genes)
ggplot(df1, aes(x = Cells, y = value)) + geom_bar(stat = "identity") +
theme(axis.title.x=element_blank(), axis.text.x = element_blank()) +
facet_wrap(~ variable, scales = "free") +
theme_bw() + theme(axis.text.x = element_text(angle = 90))
Here is the example data for anybody else:
df <- structure(list(HSP90AA1 = c(8.053308, 9.430003, 7.953726, 11.184402,
11.943733, 7.441695), SSH2 = c(12.038484, 10.687959, 9.918988,
11.056144, 11.0045, 9.774733), ACTB = c(10.557234, 10.437068,
10.078192, 8.316846, 9.240883, 7.566619), TotalTranscripts = c(33367.23,
30285.41, 30133.94, 24857.07, 23629, 22792.18)), .Names = c("HSP90AA1",
"SSH2", "ACTB", "TotalTranscripts"), class = "data.frame", row.names = c("ESC_11_TTCGCCAAATCC",
"ESC_10_TTGAGCTGCACT", "ESC_11_GCCGCGTTATAA", "ESC_11_GCATTCTGGCTC",
"ESC_11_GTTACATTTCAC", "ESC_11_CCGTTGCCCCTC"))
And here is a solution:
#New column for row names so they can be used as x-axis elements
df$rowz <- row.names(df)
#Explicitly order the rows (see the Kohske link)
df$rowz1 <- reorder(df$rowz, rev(df$TotalTranscripts))
library(reshape2)
#Melt the data from wide to long
df1 <- melt(df, id.vars = c("rowz1", "TotalTranscripts"),
measure.vars = c("HSP90AA1", "SSH2", "ACTB"))
library(ggplot2)
gp <- ggplot(df1, aes(x = rowz1, y = value)) + geom_bar(stat = "identity") +
facet_wrap(~ variable, scales = "free") +
theme_bw()
gp + theme(axis.text.x = element_text(angle = 90))
This example by Kohske is a constant reference for me on ordering elements in ggplot2.
If you have many columns, but the same six ESC complexes, you can switch the groupings, i.e. x = variable and facet_wrap(~ rowz1), but this fundamentally changes how you are visualizing/comparing your data. Also, consider facet_grid(row ~ column) if you can organize the columns by 2 components (Columns being the data that are melted into 'variable' and 'value').
And this additional SO solution isn't related to your question, but it is an elegant way to reorder elements in each facet by their values (for future reference).
Finally, the method that will give you the finest control is to plot each graph separately and combine the grobs. Baptiste's packages like gridExtra and gtable are useful for these tasks.
**EDIT in response to new information from OP**
The OP has subsequently asked how to visualize the data, especially when there are more ESC categorical variables (up to 600+).
Here are some examples, with the big caveat that with many categorical variables, they should be grouped or converted to a continuous variable somehow.
#Plot colour to a few discrete, categorical variables
gp + aes(fill = rowz1) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
labs(x = NULL, fill = "Cell", title = "Discrete categorical variables")
#Plot colour on a continuous scale.
#Ultimately, not appropriate for this example! (but shown for reference)
#More appropriate: fill = TotalTranscripts
gp + aes(fill = as.numeric(rowz1)) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
labs(x = NULL, title = "Continuous variables (legend won't work for many values)") +
scale_fill_gradient2(name = "Cell",
breaks = as.numeric(df1$rowz1),
labels = df1$rowz1,
midpoint=median(as.numeric(df1$rowz1)))
#x is continuous, colour plotted to the categorical variable.
#Same caveats as earlier.
gp1 <- ggplot(df1, aes(x = TotalTranscripts/1000, y = value, colour = rowz1)) +
geom_point(size=3) + facet_wrap(~ variable, scales = "free") +
labs(title = "X is an actual continuous variable") +
theme_bw() + labs(x = bquote("Total Transcripts,"~10^3), colour = "Cell")
gp1
I have some data here [in a .txt file] which I read into a data frame,
mydf <- read.table("data.txt", header=T,sep="\t")
I melt this data frame next using the following piece of code,
df_mlt <-melt(mydf, id=names(mydf)[1], variable = "cols")
Now I would like to plot this data as a boxplot displaying only values of x>0 , so for this I use the following code,
plt_bx <- ggplot(df_mlt, aes(x=ID1,y=value>0, color=cols))+geom_boxplot()
But the resulting plot looks like the following,
However what I need is to only display positive values of x as individual box plots in the same plot layer. Could someone please suggest what I need to change in the above code to get the proper output, Thanks.
plt_bx <- ggplot(subset(df_mlt, value > 0), aes(x=ID1,y=value, color=cols)) + geom_boxplot()
You need to subset your data frame to remove the undesirable values. Right now you're plotting value > 0, which is either TRUE or FALSE, instead of the boxplot of only the values that are greater than 0.
Based on #BrodieG suggestions, the following piece of code yields a plot as below,
plt_bx <- ggplot(subset(df_mlt, value > 0), aes(x=ID1,y=value,group=ID1)) +
geom_boxplot(aes(color=ID1),outlier.colour="orangered", outlier.size=3) +
scale_y_log10(labels = trans_format("log10", math_format(10^.x))) +
theme_bw() +
theme(legend.text=element_text(size=14), legend.title=element_text(size=14))+
theme(axis.text=element_text(size=26)) +
theme(axis.title=element_text(size=22,face="bold")) +
labs(x = "x", y = "y", colour="Values") +
annotation_logticks(sides = "rl")
plt_bx
I improved my answer, the outline of the boxplot would display different colors if color in the aes is assigned as a factor of the id from the melted data frame. i.e., geom_boxplot(aes(color=factor(ID1)))
The following code results in a plot as below,
plt <- ggplot(subset(df_mlt, value > 0), aes(x=ID1,y=value)) +
geom_boxplot(aes(color=factor(ID1))) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) +
theme_bw() +
theme(legend.text=element_text(size=14), legend.title=element_text(size=14))+
theme(axis.text=element_text(size=20)) +
theme(axis.title=element_text(size=20,face="bold")) +
labs(x = "x", y = "y",colour="legend" ) +
annotation_logticks(sides = "rl") +
theme(panel.grid.minor = element_blank()) +
guides(title.hjust=0.5) +
theme(plot.margin=unit(c(0,1,0,0),"mm"))
plt