Stacked Bar Chart for Gene Expression in R - r

I am new to data visualization. I am trying to use R to visualize DEseq2 data from galaxy to show the range of expression of genes on different chromosomes between males and female of a beetle species. I need the X to be chromosome and the Y to be expression level (log2FC).
I am trying to have this produce a color gradient going from male to female, like this example.
I have tried to do this using a standard barchart, creating a joined group for gender-log2fc:
sexlinked=read.table("clipboard", header=TRUE)
df <- sexlinked
df$group <- paste0(df$sex, "-", df$log2fc, sep="")
ggplot(df, aes(chromosomes)) +
geom_bar(aes(fill = group), colour = "grey")
Here is an image of my data (screencap of spreadsheet), in case it helps you to assist me.

Related

plotting two categorical vectors in ggridges

I have a dataset with a few organisms, which I would like to plot on my y-axis, against date, which I would like to plot on the x-axis. However, I want the fluctuation of the curve to represent the abundance of the organisms. I.e I would like to plot a time series with the relative abundance separated by the organism to show similar patterns with time.
However, of course, plotting just date against an organism does not yield any information on the abundance. So, my question is, is there a way to make the curve represent abundance using ggridges?
Here is my code for an example dataset:
set.seed(1)
Data <- data.frame(
Abundance = sample(1:100),
Organism = sample(c("organism1", "organism2"), 100, replace = TRUE)
)
Date = rep(seq(from = as.Date("2016-01-01"), to = as.Date("2016-10-01"), by =
'month'),times=10)
Data <- cbind(Date, Data)
ggplot(Data, aes(x = Abundance, y = Organism)) +
geom_density_ridges(scale=1.15, alpha=0.6, color="grey90")
This produces a plot with the two organisms, however, I want the date on the x-axis and not abundance. However, this doesn't work. I have read that you need to specify group=Date or change date into julian day, however, this doesn't change the fact that I do not get to incorporate abundance into the plot.
Does anyone have an example of a plot with date vs. a categorical variable (i.e. organism) plotted against a continuous variable in ggridges?
I really like to output from ggridges and would like to be able to use it for these visualizations. Thank you in advance for your help!
Cheers,
Anni
To use geom_density_ridges, it'll help to reshape the data to show observations in separate rows, vs. as summarized by Abundance.
library(ggplot2); library(ggridges); library(dplyr)
# Uncount copies the row "Abundance" number of times
Data_sum <- Data %>%
tidyr::uncount(Abundance)
ggplot(Data_sum, aes(x = Date, y = Organism)) +
ggridges::geom_density_ridges(scale=1, alpha=0.6, color="grey90")

How to barplot select rows of data from a dataframe in R?

This is my first time submitting a question, so apologies in advance if my formatting is not optimal.
I have a dataframe with roughly 6,000 rows of data in 2 columns, and I want to be able to pull out individual rows (and multiple rows together) to barplot.
I read my file in as a dataframe, here is a very small subset:
gene log2
1 SMa0002 0.457418
2 SMa0005 1.116950
3 SMa0007 0.686749
4 SMa0009 0.169450
5 SMa0011 0.393365
6 SMa0013 0.601940
So what I would want to be able to do is have a barplot where the x axis is a number of genes (SMaXXX, SMaXXX, SMaXXX, etc.), and the y-axis is the log2 column. It only has (+) values displayed, but there are (-) values as well. I have no real preference about whether I use barplot or geom_bar in ggplot2, or another plotter.
I know how to just plot the dataframe;
ggplot(df, aes(x = gene, y = log2)) + geom_bar(stat = "identity")
I've tried playing around with using 'match' but I haven't been able to figure out how to make that work. Ideally the code is versatile so I can just punch in different SMaXXXX codes to generate many different plots.
Thanks for reading!
It seems that you just need a way to subset your data.frame when plotting, right?
Let's assume you've got a vector subset.genes of the genes you need to plot:
df=data.frame(gene=c("SMa0002","SMa0005","SMa0006","SMa0007","SMa0011","SMa0013"),
"log2"=runif(6), stringsAsFactors=F)
subset.genes=sample(unique(df$gene), 4, replace=F)
A couples of ways:
1°) Inside ggplot2
ggplot(df, aes(x = gene, y = log2)) + geom_bar(stat = "identity") +
scale_x_discrete(limits=subset.genes)
2°) before:
df2 <- subset(df, gene %in% subset.genes)
ggplot(df2, aes(x = gene, y = log2)) + geom_bar(stat = "identity")

ggplot2 adding stacked barchart to heatmap

I would like to add functional information to a HeatMap (geom_tile). I've got the following simplified DataFrame and R code producing a HeatMap and a separate stacked BarPlot (in the right order, corresponding to the HeatMap).
Question:
How can I add the BarPlot to the right edge/side of the Heatmap?? It shouldn't overlap with any of the tiles, and the tiles of the BarPlot should align with the tiles of the HeatMap.
Data:
AccessionNumber <- c('A4PU48','A9YWS0','B7FKR5','G4W9I5','B7FGU7','B7FIR4','DY615543_2','G7I6Q7','G7I9C1','G7I9Z0','A4PU48','A9YWS0','B7FKR5','G4W9I5','B7FGU7','B7FIR4','DY615543_2','G7I6Q7','G7I9C1','G7I9Z0','A4PU48','A9YWS0','B7FKR5','G4W9I5','B7FGU7','B7FIR4','DY615543_2','G7I6Q7','G7I9C1','G7I9Z0','A4PU48','A9YWS0','B7FKR5','G4W9I5','B7FGU7','B7FIR4','DY615543_2','G7I6Q7','G7I9C1','G7I9Z0')
Bincode <- c(13,25,29,19,1,1,35,16,4,1,13,25,29,19,1,1,35,16,4,1,13,25,29,19,1,1,35,16,4,1,13,25,29,19,1,1,35,16,4,1)
MMName <- c('amino acid metabolism','C1-metabolism','protein','tetrapyrrole synthesis','PS','PS','not assigned','secondary metabolism','glycolysis','PS','amino acid metabolism','C1-metabolism','protein','tetrapyrrole synthesis','PS','PS','not assigned','secondary metabolism','glycolysis','PS','amino acid metabolism','C1-metabolism','protein','tetrapyrrole synthesis','PS','PS','not assigned','secondary metabolism','glycolysis','PS','amino acid metabolism','C1-metabolism','protein','tetrapyrrole synthesis','PS','PS','not assigned','secondary metabolism','glycolysis','PS')
cluster <- c(1,2,2,2,3,3,4,4,4,4,1,2,2,2,3,3,4,4,4,4,1,2,2,2,3,3,4,4,4,4,1,2,2,2,3,3,4,4,4,4)
variable <- c('rd2c_24','rd2c_24','rd2c_24','rd2c_24','rd2c_24','rd2c_24','rd2c_24','rd2c_24','rd2c_24','rd2c_24','rd2c_48','rd2c_48','rd2c_48','rd2c_48','rd2c_48','rd2c_48','rd2c_48','rd2c_48','rd2c_48','rd2c_48','rd2c_72','rd2c_72','rd2c_72','rd2c_72','rd2c_72','rd2c_72','rd2c_72','rd2c_72','rd2c_72','rd2c_72','rd2c_96','rd2c_96','rd2c_96','rd2c_96','rd2c_96','rd2c_96','rd2c_96','rd2c_96','rd2c_96','rd2c_96')
value <- c(2.15724042939,1.48366099919,1.29388509992,1.59969471112,1.82681962192,2.13347487296,1.08298157478,1.20709456306,1.02011775131,0.88018823632,1.41435923375,1.31680079684,1.32041325076,1.23402873856,2.04977975574,1.90651971106,0.911615352178,1.05021352328,1.18437303394,1.05620421143,1.02132613918,1.22080237755,1.40759491365,1.43131574695,1.65848581311,1.91886008221,0.639581269674,1.11779720968,1.09406554542,1.02259316617,1.00529867534,1.30885290475,1.39376458384,1.35503544429,1.81418617518,1.92505106722,0.862870707741,1.0832577668,1.03118887309,1.21310404226)
df <- data.frame(AccessionNumber, Bincode, MMName, cluster, variable, value)
HeatMap plot:
hm <- ggplot(df, aes(x=variable, y=AccessionNumber))
hm + geom_tile(aes(fill=value), colour = 'white') + scale_fill_gradient2(low='blue', midpoint=1, high='red')
stacked BarPlot:
bp <- ggplot(df, aes(x=sum(df$Bincode), fill=MMName))
bp + stat_bin(aes(ymax = ..count..), binwidth = 1, geom='bar')
Thank you very much for your help/support!!
The variables of the y-axis are sorted first by increasing "cluster" then alphabetically by "AccessionNumber". This is true for both the HeatMap as well as the BarPlot. The values appear in the same order in both plots, but show two different variables (same amount of rows and in the same order, but different content). The HeatMap displays a continuous variable in contrast to the BarPlot which displays a categorical variable. Therefore, the plots could be combined, displaying additional information.
Please help!

Ordering in ggplot2 [plotting pvals by BP for each chr]

I'm trying to plot points along the genome: there will be plot points for every chromosome. My data file looks like this:
CHROM BP P DP
1 234567 0.0000555 30
.....
Y 12345678 0.09 14
I'm using gglopt2 to plot P values, coloured by DP, for each chromosome, using the following:
mc.points <- ggplot(sample,aes(x = BP,y = P, colour =DP)) +
geom_point() +
labs(x = "Chromosome",y = "P") +
scale_color_gradient2(low = "green", high = "red")
However, instead of being plotted at each BP in the right chromosomal order, its being plotted by BP without any thought of chromosome number.
Is there a way to sort the data to make this happen (ie order by chromosome then BP)? I've tried to make CHROM and BP factors but this seems to crash R. In addition, if this is possible is there a way to label the X-tics on the X axis as chromosome numbers rather than BP (similar to a Manhattan plot).
I can provide dummy data if need be but this is quite long.
Just to provide an update: facet_grid seems to solve my problem but I was wondering whether I can transform this? It splits the grids by chromosome, but doesn't plot them on the same x-axis in consecutive order - But plots 22 different plots using the same scale x-axis. Any solutions?????
Have you tried something this untested code before the plot:
sample$BP <- factor(sample$BP,
levels=sample[ !duplicated(sample[,"BP"]), "BP"][
order(sample[!duplicated(sample[ ,"BP"]), "chromosome"] )]
)
Would have been easier and perhaps more compact if you included a suitable sample for testing. In the future you should NOT use the name `sample" since it is an important R function name.

Creating density plots from two different data-frames using ggplot2

My goal is to compare the distribution of various socioeconomic factor such as income over multiple years to see how the population has evolved in particular region in say, over 5 years. The primary data for this comes from the Public Use Microdata Sample. I am using R + ggplot2 as my preferred tool.
When comparing two years worth of data (2005 and 2010) I have two data frames hh2005 and hh2010 with the household data for the two years. The income data for the two years are stored in the variable hincp in both data frames. Using ggplot2 I am going about creating the density plot for individual years as follows (example for 2010):
p1 <- ggplot(data = hh2010, aes(x=hincp))+
geom_density()+
labs(title = "Distribution of income for 2010")+
labs(y="Density")+
labs(x="Household Income")
p1
How do I overlay the 2005 density over this plot? I am unable to figure it out as having read data in as hh2010 I am not sure how to proceed. Should I be processing the data in a fundamentally different way from the very beginning?
You can pass data arguments to individual geoms, so you should be able to add the second density as a new geom like this:
p1 <- ggplot(data = hh2010, aes(x=hincp))+
geom_density() +
# Change the fill colour to differentiate it
geom_density(data=hh2005, fill="purple") +
labs(title = "Distribution of income for 2010")+
labs(y="Density")+
labs(x="Household Income")
This is how I would approach the problem:
Tag each data frame with the variable of interest (in this case, the year)
Merge the two data sets
Update the 'fill' aesthetic in the ggplot function
For example:
# tag each data frame with the year^
hh2005$year <- as.factor(2005)
hh2010$year <- as.factor(2010)
# merge the two data sets
d <- rbind(hh2005, hh2010)
d$year <- as.factor(d$year)
# update the aesthetic
p1 <- ggplot(data = d, aes(x=hincp, fill=year)) +
geom_density(alpha=.5) +
labs(title = "Distribution of income for 2005 and 2010") +
labs(y="Density") +
labs(x="Household Income")
p1
^ Note, the 'fill' parameter seems to work best when you use a factor, thus I defined the years as such. I also set the transparency of the overlapping density plots with the 'alpha' parameter.

Resources