Plot sample-vs-sample gene expression levels in R - r

I have a data set containing gene expression data for various genes, across 24 different samples. In my current dataframe, each row is a gene and each column is a sample.
I want to create a dot plot where each dot is a gene, the y-axis represents the expression of that gene in sample A, and the x-axis represents the expression of the same gene in sample B.
I have tried to search for this but don't know what such a plot is called or how I can find it. Most of my other plots are plotted with ggplot2, but it does not matter what package is used to solve the problem.
Example data:
sample_A<-c(2,3,1)
sample_B<-c(-1,4,-3)
genes <- c("gene1","gene2","gene3")
df<-data.frame(sample_A,sample_B,row.names = genes)
Data frame:
sample_A sample_B
gene1 2 -1
gene2 3 4
gene3 1 -3

geom_point with ggplot2 is probably what you're looking for. The dots can also be labelled using geom_label.
require(ggplot2)
p <- ggplot(df, aes(x = sample_B, y = sample_A))+
geom_point()+
geom_label(aes(label = rownames(df)))

Related

Creating a boxplot showing the spread of gene-expression within different samples

So I have a dataframe containing the 10 most upregulated genes in cancer-samples compared to control-samples and the 10 most downregulated genes.
It looks like this:
I want to create a neat boxplot to compare the spread of each gene between patient-samples and control-samples (there are 4 samples of each type).
The problems I have is that I don't get all boxes along-side each other in a row/in the same graph, but like this:
I would also like it to show the gene's boxes sorted by the "log2FC-value", and not be in alphabetical order. Does anyone know how to fix this??
This is my code I used:
#Im using the dataframe called "Dataframe"
#Make a column for the genenames
Dataframe$genenames <- rownames(Dataframe)
#Get data into a long-format
long_Dataframe <- gather(Dataframe, key="samples",
value="values", -c(log2FC, gennames,))
#Creating a new column called "group", stating if each row belongs to patient/control
long_Dataframe$group <- rep(c("Control", "Patient"), each=40)
#Order rows by log2FC - from lowest to highest
long_Dataframe <- long_Dataframe[order(long_Dataframe$log2FC), ,
drop=FALSE]
#Use long data for boxplot of top 20 up/downregulated genes
Boxplot_top20 <- ggplot(long_Dataframe, aes(x=genenames, y=values, fill=group)) +
geom_boxplot() +
scale_fill_manual(values=c("green", "red")) +
theme_light() +
facet_wrap(~genenames, scales="free")
You may use geom_boxplot(position=position_dodge()) instead of facet_wrap() to place your boxplots by pair within group.

Colouring a PCA plot by clusters in R

I have some biological data that looks like this, with 2 different types of clusters (A and B):
Cluster_ID A1 A2 A3 B1 B2 B3
5 chr5:100947454..100947489,+ 3.31322 7.52365 3.67255 21.15730 8.732710 17.42640
12 chr5:101227760..101227782,+ 1.48223 3.76182 5.11534 15.71680 4.426170 13.43560
29 chr5:102236093..102236457,+ 15.60700 10.38260 12.46040 6.85094 15.551400 7.18341
I clean up the data:
CAGE<-read.table("CAGE_expression_matrix.txt", header=T)
CAGE_data <- as.data.frame(CAGE)
#Remove clusters with 0 expression for all 6 samples
CAGE_filter <- CAGE[rowSums(abs(CAGE[,2:7]))>0,]
#Filter whole file to keep only clusters with at least 5 TPM in at least 3 files
CAGE_filter_more <- CAGE_filter[apply(CAGE_filter[,2:7] >= 5,1,sum) >= 3,]
CAGE_data <- as.data.frame(CAGE_filter_more)
The data size is reduced from 6981 clusters to 599 after this.
I then go on to apply PCA:
#Get data dimensions
dim(CAGE_data)
PCA.CAGE<-prcomp(CAGE_data[,2:7], scale.=TRUE)
summary(PCA.CAGE)
I want to create a PCA plot of the data, marking each sample and coloring the samples depending on their type (A or B.) So it should be two colors for the plot with text labels for each sample.
This is what I have tried, to erroneous results:
qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))
ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()
qplot(PCA.CAGE[1:3], PCA.CAGE[4:6], label=colnames(PC1, PC2, PC3), geom=c("point", "text"))
The errors appear as such:
> qplot(PCA.CAGE$x[,1:3],PCA.CAGE$x[4:6,], xlab="Data 1", ylab="Data 2")
Error: Aesthetics must either be length one, or the same length as the dataProblems:PCA.CAGE$x[4:6, ]
> qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
Error: Aesthetics must either be length one, or the same length as the dataProblems:CAGE_data, CAGE_data
> ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()
Error: ggplot2 doesn't know how to deal with data of class
Your question doesn't make sense (to me at least). You seem to have two groups of 3 variables (the A group and the B group). When you run PCA on these 6 variables, you'll get 6 principle components, each of which is a (different) linear combination of all 6 variables. Clustering is based on the cases (rows). If you want to cluster the data based on the first two PCs (a common approach), then you need to do that explicitly. Here's an example using the built-in iris data-set.
pca <- prcomp(iris[,1:4], scale.=TRUE)
clust <- kmeans(pca$x[,1:2], centers=3)$cluster
library(ggbiplot)
ggbiplot(pca, groups=factor(clust)) + xlim(-3,3)
So here we run PCA on the first 4 columns of iris. Then, pca$x is a matrix containing the principle components in the columns. So then we run k-means clustering based on the first 2 PCs, and extract the cluster numbers into clust. Then we use ggibplot(...) to make the plot.

Plotting gene expression data with means in a randomized experiment

I'm (a newbie to R) analyzing a randomized study on the effect of two treatments on gene expression. We evaluated 5 different genes at baseline and after 1 year. The gene fold is calculated as the value at 1 year divided by the baseline value.
Example gene:
IL10_BL
IL10_1Y
IL10_fold
Gene expression is measured as a continuous variable, typically ranging from 0.1 to 5.0.
100 patients have been randomized to either a statin or diet regime.
I would like to do the following plot:
- Y axis should display the mean gene expression with 95% confidence limit
- X axis should be categorical, with the baseline, 1 year and fold value for each of the 5 genes, grouped by treatment. So, 5 genes with 3 values for each gene in two groups would mean 30 categories on the X axis. It would be really nice of the dots for the same gene would be connected with a line.
I have tried to do this myself (using ggplot2) without any success. I've tried to do it directly from the crude data, which looks like this (first 6 observations and 2 different genes):
genes <- read.table(header=TRUE, sep=";", text =
"treatment;IL10_BL;IL10_1Y;IL10_fold;IL6_BL;IL6_1Y;IL6_fold;
diet;1.1;1.5;1.4;1.4;1.4;1.1;
statin;2.5;3.3;1.3;2.7;3.1;1.1;
statin;3.2;4.0;1.3;1.5;1.6;1.1;
diet;3.8;4.4;1.2;3.0;2.9;0.9;
statin;1.1;3.1;2.8;1.0;1.0;1.0;
diet;3.0;6.0;2.0;2.0;1.0;0.5;")
I would greatly appreciate any help (or link to a similar thread) to do this.
First, you need to melt your data into a long format, so that one column (your X column) contains a categorical variable indicating whether an observation is BL, 1Y, orfold.
(your command creates an empty column you might need to get rid of first: genes$X = NULL)
library(reshape2)
genes.long = melt(genes, id.vars='treatment', value.name='expression')
Then you need the gene and measurement (baseline, 1-year, fold) in different columns (from this question).
genes.long$gene = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 1))
genes.long$measurement = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 2))
And put the measurement in the order that you expect:
genes.long$measurement = factor(genes.long$measurement, levels=c('BL', '1Y', 'fold'))
Then you can plot using stat_summary() calls for the mean and confidence intervals. Use facets to separate the groups (treatment and gene combinations).
ggplot(genes.long, aes(measurement, expression)) +
stat_summary(fun.y = mean, geom='point') +
stat_summary(fun.data = 'mean_cl_boot', geom='errorbar', width=.25) +
facet_grid(.~treatment+gene)
You can reverse the order to facet_grid(.~gene+treatment) if you want the top level to be gene instead of treatment.

Color Dependent Bar Graph in R

I'm a bit out of my depth with this one here. I have the following code that generates two equally sized matrices:
MAX<-100
m<-5
n<-40
success<-matrix(runif(m*n,0,1),m,n)
samples<-floor(MAX*matrix(runif(m*n),m))+1
the success matrix is the probability of success and the samples matrix is the corresponding number of samples that was observed in each case. I'd like to make a bar graph that groups each column together with the height being determined by the success matrix. The color of each bar needs to be a color (scaled from 1 to MAX) that corresponds to the number of observations (i.e., small samples would be more red, for instance, whereas high samples would be green perhaps).
Any ideas?
Here is an example with ggplot. First, get data into long format with melt:
library(reshape2)
data.long <- cbind(melt(success), melt(samples)[3])
names(data.long) <- c("group", "x", "success", "count")
head(data.long)
# group x success count
# 1 1 1 0.48513473 8
# 2 2 1 0.56583802 58
# 3 3 1 0.34541582 40
# 4 4 1 0.55829073 64
# 5 5 1 0.06455401 37
# 6 1 2 0.88928606 78
Note melt will iterate through the row/column combinations of both matrices the same way, so we can just cbind the resulting molten data frames. The [3] after the second melt is so we don't end up with repeated group and x values (we only need the counts from the second melt). Now let ggplot do its thing:
library(ggplot2)
ggplot(data.long, aes(x=x, y=success, group=group, fill=count)) +
geom_bar(position="stack", stat="identity") +
scale_fill_gradient2(
low="red", mid="yellow", high="green",
midpoint=mean(data.long$count)
)
Using #BrodieG's data.long, this plot might be a little easier to interpret.
library(ggplot2)
library(RColorBrewer) # for brewer.pal(...)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=count),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)
Note that actual values are probably different because you use random numbers in your sample. In future, consider using set.seed(n) to generate reproducible random samples.
Edit [Response to OP's comment]
You get numbers for x-axis and facet labels because you start with matrices instead of data.frames. So convert success and samples to data.frames, set the column names to whatever your test names are, and prepend a group column with the "list of factors". Converting to long format is a little different now because the first column has the group names.
library(reshape2)
set.seed(1)
success <- data.frame(matrix(runif(m*n,0,1),m,n))
success <- cbind(group=rep(paste("Factor",1:nrow(success),sep=".")),success)
samples <- data.frame(floor(MAX*matrix(runif(m*n),m))+1)
samples <- cbind(group=success$group,samples)
data.long <- cbind(melt(success,id=1), melt(samples, id=1)[3])
names(data.long) <- c("group", "x", "success", "count")
One way to set a threshold color is to add a column to data.long and use that for fill:
threshold <- 25
data.long$fill <- with(data.long,ifelse(count>threshold,max(count),count))
Putting it all together:
library(ggplot2)
library(RColorBrewer)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=fill),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)+
theme(axis.text.x=element_text(angle=-90,hjust=0,vjust=0.4))
Finally, when you have names for the x-axis labels they tend to get jammed together, so I rotated the names -90°.

Adding arbitrary labels to each group in a grouped scatterplot in ggplot2

I have a list of matrices I wish to plot. Each element in the list ultimately represents a facet to be plotted. Each matrix element has dimensions Row * Col, where all values in a row are grouped and plotted as a scatterplot (i.e. X-axis is categorical for the row names, Y-axis is the value, Col. values per row).
Additionally, I would like to add the CV for the distribution of points at each X.
X1 X2 value L1
a subject1 8.026494 facet1
b subject1 7.845277 facet1
c subject1 8.189731 facet1
(10 categorical groupings - a-j)
a subject2 5.148875 facet1
b subject2 8.023356 facet1
(33 subjects plotted for each categorical grouping)
a subject1 5.148875 facet2
b subject1 8.023356 facet2
(multiple facets (in my specific case, 50) with identical categorical grouping and subject names)
I managed to plot this to my satisfaction with the following:
p <- (qplot(X1, value, data=melt(df), colour=X2)
+ facet_wrap(~Probeset, ncol=10, nrow=5, scales="free_x"))
However, I would like to add the CV of each grouping of points along the X-axis as a label hovering above the group. I tried variations on this:
p + geom_text(aes(x=X1, y="arbitrary value at the top of the Y-axis scale", label="vector of labels")))
But none of them behaved as I wish. How would I go about getting the CV of each group above the group of points, as a label?
Thank you in advance!
Since there is no X2 corresponding to each label, you have to put the labels into a separate data set, and supply it in the data argument of geom_text. Using a reproducible example:
library(ggplot2)
#create data with the desired structure
dd <- expand.grid(facet=LETTERS[1:4], group=letters[1:5], subject=factor(1:10))
dd$value <- exp(rnorm(nrow(dd)))
#calculate CV's
ddcv <- ddply(dd, .(facet,group),
function(x)c(CV=sd(x$value)/mean(x$value), maxX=max(x$value)))
ddcv$CV <- round(ddcv$CV,1)
#make plots
p <- qplot(group, value, colour=subject, data=dd) + facet_wrap(~facet)
p + geom_text(aes(x=group, y=maxX+1, label=CV), colour="black", data=ddcv)

Resources