Colour regulated genes in Volcano plot - ggplot2 - r

Hi I'm very new in R and I'm struggling trying to modify an R code that I found on internet when learning how to make a volcano plot.
This code is to make volcano plots using ggplot2 and the problem I have is that I want to colour the up- and down-regulated proteins instead of colouring the proteins above the specified threshold. The code I'm using is the following:
install.packages("ggplot2")
gene_list <- read.table("/Users/Javi/Desktop/gene_list.csv", header=T, sep=",")
require(ggplot2)
##Highlight genes that have an absolute fold change > 2 and a p-value < 0.05
gene_list$threshold = as.factor(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.05)
##Construct the plot object
g = ggplot(data=gene_list, aes(x=logFC, y=-log10(P.Value), colour=my_palette)) +
geom_point(alpha=0.4, size=5) +
theme(legend.position = "none") +
xlim(c(-10, 10)) + ylim(c(0, 15)) +
xlab("log2 fold change") + ylab("-log10 p-value")
g
What I would like to do is to colour in red (for example) the logFC values > 1.3 and in blue the logFC values < -1.3
The csv file I'm using is just an example and would be something like this:
logFC P.Value
a 2 0.04
b 5 0.04
c 8 0.04
d 4 0.000005
e 7 0.01
f 1 0.04
g -6 0.0001
h -8 0.04
Thanks very much for your help in advance.
Cheers
Javi

Create a new color flag on your dataframe:
gene_list$color_flag <- ifelse(gene_list$logFC > 1.3, 1, ifelse(gene_list$logFC < -1.3, -1, 0))
Then add fill = color_flag to your aes.

Related

Ggplot does not label all interesting peptides for volcano plot

I have a dataframe containing 2479 peptides with their sequence, p-value and logfold change.
# A tibble: 6 x 3
Sequence p log2fold
<chr> <dbl> <dbl>
1 FLENEDR 0.343 1.21
2 DTEEEDFHVDQATTVK 0.270 0.771
3 DTEEEDFHVDQATTVK 0.112 1.18
4 SCRASQSVSSSF 0.798 0.139
5 RLSCTTSGF 0.739 0.110
6 SCRASQSVSSSY 0.209 0.375
I'm trying to make a volcano plot while labelling the up and downregulated peptides. However, for some reason, ggplot only uses 6 labels. I have no idea why.
I have trying loads of different things. I tried using up and downregulation in expression column, I tried increasing and decreasing my cut-off values to check if this was a problem. I used ggrepel to try and center them out more. Nothing seems to be working. My latest tries with the code is in this code.
Basically as a last resort I made a new group and only took the significant and fold change peptides with me, resulting in 39 peptides. Then I used this as header and matched peptides between the two dataframes.
Another problem that arises is in my legend, a character appears since using geom_text_repel. I have no idea how or why this is happening.
library(ggplot2)
library(ggrepel)
library(tidyverse)
Volc <- R_volcano
expression <- ifelse(Volc$p < 0.05 & abs (Volc$log2fold) >=1, ifelse(Volc$log2fold>1, 'up', 'down'), 'stable')
Volc <- cbind(Volc, expression)
colnames(Volc)[1] <- 'Sequencenames'
Volc["group"] <- "NotSignificant"
Volc[which(Volc['p'] < 0.05 & abs(Volc['log2fold']) < 1 ),"group"] <- "Significant"
Volc[which(Volc['p'] > 0.05 & abs(Volc['log2fold']) > 1 ),"group"] <- "FoldChange"
Volc[which(Volc['p'] < 0.05 & abs(Volc['log2fold']) > 1 ),"group"] <- "Significant&FoldChange"
VolcFilter <- Volc %>% filter(group=="Significant&FoldChange")
p <- ggplot(data = Volc, aes(x = log2fold, y = -log10(p), colour=expression, label='Sequencenames')) +
geom_point(alpha=0.4, size=2) +
scale_color_manual(values=c("blue", "grey","red"))+
xlim(c(-4.5, 4.5)) +
geom_vline(xintercept=c(-1,1),lty=4,col="black",lwd=0.8) +
geom_hline(yintercept = 1.301,lty=4,col="black",lwd=0.8) +
geom_text_repel(data=head(VolcFilter), aes(label=Sequencenames))+
labs(x="log2(fold change)",
y="-log10 (p-value)",
title="Differential expression") +
theme_bw()+
theme(plot.title = element_text(hjust = 0.5),
legend.position="right",
legend.title = element_blank())
p
Any help is much appreciated. Fairly new to R.

geom_point points manual scaling

I got some data (named result.df) which looks like the following:
orgaName abundance pVal score
A 3 9.998622e-01 1.795338e-04
B 2 9.999790e-01 1.823428e-05
C 1 2.225074e-308 3.076527e+02
D 1 3.510957e-01 4.545745e-01
and so on...
What I am now plotting is this:
p1 <- ggplot(result.df, aes(log2(abundance), (1-pVal), label=orgaName)) +
ylab("1 - P-Value")+
xlab("log2(abundance)")+
geom_point(aes(size=score))+
ggtitle(colnames(case.count.matrix)[i])+
geom_text(data=subset(result.df, pVal < 0.05),hjust=.65, vjust=-1.2,size=2.5)+
geom_hline(aes(yintercept=.95), colour="blue", linetype="dashed")+
theme_classic()
Everything works fine and looks rather fine. However, what I would like is to scale the point size introduced through
geom_point(aes(size=score))+
to be scaled against fixed values. So the legend should scale in a decadic logarithm but the score should stay the same. Such that low scores nearly disappear and large scores are kind of comparable with respect to their point size between different "result.df".
EDIT
After checking on the comments of #roman and #vrajs5 I was able to produce a plot like this .
Using the following code:
ggplot(result.df, aes(log2(abundance), (1-pVal), label=orgaName)) +
ylab("1 - P-Value")+
xlab("log2(abundance)")+
geom_point(aes(size=score))+
ggtitle(colnames(case.count.matrix)[i])+
#geom_text(data=subset(result.df, pVal < 0.05 & log2(abundance) > xInt),hjust=.65, vjust=-1.2,size=2.5)+
geom_text(data=subset(result.df, pVal < 0.05),hjust=.65, vjust=-1.2,size=2.5)+
geom_hline(aes(yintercept=.95), colour="blue", linetype="dashed")+
#geom_vline(aes(xintercept=xInt), colour="blue", linetype="dashed")+
#geom_text(data=subset(result.df, pVal > 0.05 & log2(abundance) > xInt),alpha=.5,hjust=.65, vjust=-1.2,size=2)+
#geom_text(data=subset(result.df, pVal < 0.05 & log2(abundance) < xInt),alpha=.5,hjust=.65, vjust=-1.2,size=2)+
theme_classic() +
scale_size(range=c(2,12),expand=c(2,0),breaks=c(0,1,10,100,1000,1000000),labels=c(">=0",">=1",">=10",">=100",">=1000",">=1000000"),guide="legend")
As you can see, the breaks are introduced and labeled as intendet. However the point size in the legend does not reflect the point sizes in the plot. Any idea how to fix this?
As #Roman mentioned, if you use scale_size you can specify the limits on size..
Following is the example how to control size of points
result.df = read.table(text = 'orgaName abundance pVal score
A 3 9.998622e-01 1.795338e-04
B 2 9.999790e-01 1.823428e-05
C 1 2.225074e-308 3.076527e+02
D 1 3.510957e-01 4.545745e-01
E 3 2.510957e-01 2.545745e+00
F 3 1.510957e-02 2.006527e+02
G 2 5.510957e-01 3.545745e-02', header = T)
library(ggplot2)
ggplot(result.df, aes(log2(abundance), (1-pVal), label=orgaName)) +
ylab("1 - P-Value")+
xlab("log2(abundance)")+
geom_point(aes(size=score))+
#ggtitle(colnames(case.count.matrix)[i])+
geom_text(data=subset(result.df, pVal < 0.05),hjust=.65, vjust=-1.2,size=2.5)+
geom_hline(aes(yintercept=.95), colour="blue", linetype="dashed")+
theme_classic() +
scale_size(range = c(2,12))
Output graph is

adding multiple expected p-value lines lines to QQ-plot in R

I am wondering how I can plot a QQ plot with multiple p-value vectors for different studies in one plot.
I am using the following code to generate a QQ-plot:
install.packages("ggplot2")
library(ggplot2)
The code for qq can be found here: http://gettinggeneticsdone.blogspot.com/2009/11/qq-plots-of-p-values-in-r-using-ggplot2.html
qq(data$Pvals, title="My Quantile-Quantile Plot")
Now I have 4 studies, so 4 $Pval vectors. I am able to add in the first Pval1 as :
qq(data$Pval1, title="My Quantile-Quantile Plot")
How can I add labeled lines of observed p-values for the remaining studies? -> Pval2, Pval3, Pval4. Essentially I'd like to display the QQ-plot with 4 observed p-value lines representing the 4 studies in one graph.
Please help!
Thanks!
Can you share how your data looks? I think the answer you're looking for is defining the group variable in the aes string. For instance,
UPDATE TO TRANSPOSE DATA SET
# install.packages('ggplot2') # only needs to be installed first time
# install.packages('reshape2') # only needs to be installed first time
library(ggplot2)
library(reshape2)
# fakeData
# RowNum Pval1 Pval2 Pval3 Pval4
# 1 0.5 0.5 0.5 0.5
# 2 0.5 0.5 0.5 0.5
# 3 0.5 0.5 0.5 0.5
#
# melt(fakeData, id.vars = 'RowNum')
# RowNum variable value
# 1 Pval1 0.5
# 1 Pval2 0.5
# 1 Pval3 0.5
ORIGINAL CODE
df <- data.frame(Group = rep(c('A', 'B', 'C', 'D'), 50),
Number = sample(1:100, 200, replace = T))
ggplot(df, aes(sample = Number, group = Group, color = Group)) +
geom_point(stat = 'qq')

R programming - creating a graph, with variable colors

I do not have much experience in R, and I wonder if they can help me in this situation.
I have the following matrix:
mat <- matrix(c(0,0.5,0.2,0.23,0.6,0,0,0.4,
0.56,0.37,0,0.32,0.4,0.99,0.54,0.6,0,0.39), ncol=6, nrow=3)
dimnames(mat) = list(
c("y1","y2","y3"),
c("day1","day2","day3","day4","day5","day6")
)
> mat
day1 day2 day3 day4 day5 day6
y1 0.0 0.23 0.00 0.37 0.40 0.60
y2 0.5 0.60 0.40 0.00 0.99 0.00
y3 0.2 0.00 0.56 0.32 0.54 0.39
>
I want to know how can I get a graph where points would be marked based on the matrix.
The values ​​are arbitrary in the interval [0,1]. It is possible to change the color of the generated points as a set of constraints?
Example:
(0,0.2] - Red
(0.2,0.4] - Green
(0.4,0.6] - Yellow
(0.6,0.9] - Blue
(0.9,1] - Black
I apologize if I have not explained myself well.
Thank you!
Assuming that your range for yellow should be [0.4,0.6] (otherwise you haven't specified a colour for (0.4,0.5) - you need to even if your data doesn't require it)
image(mat,col=c("red","green","yellow","blue","black"),breaks=c(0,0.2,0.4,0.6,0.9,1))
I've ignored the interval endpoint issue.
If you just want coloured points, something like this will do it:
palette(c("red","green","yellow","blue","black"))
plot.default(
as.data.frame.table(t(mat))[1:2],
col=findInterval(t(mat),c(0,0.2,0.4,0.6,0.9)),
pch=19,
axes=FALSE,ann=FALSE,
panel.first=grid()
)
axis(2,at=1:length(rownames(mat)),labels=rownames(mat),lwd=0,lwd.ticks=1,las=1)
axis(1,at=1:length(colnames(mat)),labels=colnames(mat),lwd=0,lwd.ticks=1)
box()
palette("default")
Result:
To assign colours to the different intervals, you can break up your values into groups using cut. Like others have said, it's a bit unclear what to do with points on the boundaries, so I've set include.lower to TRUE:
library(reshape2)
df = melt(mat)
colnames(df)[1:2] = c('year', 'day')
df$value_groups = cut(df$value, breaks=c(0,0.2,0.4,0.6,0.9,1), include.lower=TRUE)
library(ggplot2)
ggplot(df, aes(x=day, y=value, colour=value_groups, shape=year)) +
geom_point(size=3)
Result:
Here is how I would do it using lattice:
library(reshape2)
library(lattice)
mmat <- melt(mat) # reshaping the data
# note that zero isn't included in the interval
mmat$colors <- cut(mmat$value, breaks=seq(0, 1, 0.2), include.lower=TRUE) # stealing from Marius
xyplot(value ~ Var2 | Var1, mmat, groups = colors,
par.settings = list(superpose.symbol =
list(col = c('red', 'green', 'yellow', 'blue', 'black'))))

Troubles with ggplot and geom_bar

Here the updated example:
df <- data.frame(a=rep(c("A","B"),each=10),
b=rep(rep(c("C","D"),each=5),2),
c=c(sample(letters[1:5]), sample(letters[6:10]),
sample(letters[1:5]), sample(letters[6:10])),
d=c(0.10,0.18,0.34,0.35,0.59,0.16,0.38,0.40,0.53,0.58,
0.37,0.62,0.83,1.46,-0.91,-0.79,-0.52,-0.43,-0.01,0.34))
> df
a b c d
1 A C b 0.10
2 A C e 0.18
3 A C a 0.34
4 A C c 0.35
5 A C d 0.59
6 A D i 0.16
7 A D j 0.38
8 A D h 0.40
9 A D f 0.53
10 A D g 0.58
11 B C e 0.37
12 B C d 0.62
13 B C a 0.83
14 B C c 1.46
15 B C b -0.91
16 B D f -0.79
17 B D i -0.52
18 B D h -0.43
19 B D j -0.01
20 B D g 0.34
If you look closely, you will see that column d is ordered within column b always from smallest to largest.
The first plot is how I would like to have the plot apart from the fact, that the bars displayed are not in the order of d. So the bars do not appear from smallest to largest:
p <- ggplot(df, aes(x=c, y=d, fill=b, stat="identity")) +
facet_grid(. ~ a) +
geom_bar()
print(p)
This is because column c is a factor and the factors are apparently not ordered in the same order as column d is. So I did the following:
df$c <- paste(1:nrow(df), df$c, sep="_")
df$c <- factor(df$c, levels = unfactor(df$c))
p <- ggplot(df, aes(x=c, y=d, fill=b, stat="identity")) +
facet_grid(. ~ a) +
geom_bar()
print(p)
produces the following plot:
Here the order is correct. However, as you can see I created unique factors I get those spaces for the ones not present in A and B respectively.
How can I sort that out?
Now that you have changed the question, 'ggplot' cannot do this for you. By giving [df$c] levels, you could order the data but only based on the first set of [c] values. For instance:
df$c <- factor(df$c, levels=levels(df$c)[order(df$d)])
But that won't work, since you're trying to sort [df$c] twice (once for "A", and once for "B").
You really need to break this into two separate plots, and just plot the two viewports next to each other.
Setting up the viewports:
grid.newpage()
pushViewport(viewport(layout = grid.layout(1, 2)))
Plot A:
a_df <- df[df$a=="A",]
a_df$c <- factor(a_df$c, levels=levels(a_df$c)[order(a_df$d)])
a_p <- ggplot(a_df, aes(x=1:10, y=d, fill=b)) +
facet_grid(. ~ a) +
geom_bar(stat="identity", position="dodge")
print(a_p, vp = viewport(layout.pos.row=1, layout.pos.col=1))
Plot B:
b_df <- df[df$a=="B",]
b_df$c <- factor(b_df$c, levels=levels(b_df$c)[order(b_df$d)])
b_p <- ggplot(b_df, aes(x=1:10, y=d, fill=b)) +
facet_grid(. ~ a) +
geom_bar(stat="identity", position="dodge")
print(b_p, vp = viewport(layout.pos.row=1, layout.pos.col=2))
From here, you can worry about removing the excess legend, choosing which axes to label and such, but it looks exactly like your example plot only with the empty locations removed.
This is really an example of how 'ggplot' is sometimes more of a hindrance and less of a boon. In my experience, it's best to first design your plot and then choose the tool. Frequently, I find myself going back to raw 'grid' to do my visuals, because I want something the 'grid' wrapper 'ggplot' just won't do.
Note: In the future, don't delete your original question content; just add the updated info. Removing the old content makes a lot of the answers and comments on this page irrelevant.
I think this is actually a common mistake with the 'ggplot' function. If you set an outline color (i.e. aes(colour="red")), you will see that you are actually plotting all four values, but they are plotting on top of each other. The stacking warning is because the default value of 'position' is "stack". Just include the position="dodge" argument, and that will go away.
Now, to actually solve your problem. You need to give 'ggplot' something to distinguish between the values of X(A), X(B), Y(A), and Y(B). At first glance, you might be tempted to use your [b] values, but you don't want all of the extra spaces. Let's adjust your dataframe to have only 1s and 2s for [b]:
df <- data.frame(a=rep(rep(c("A","B"),each=2),2),
b=rep(1:2,4),
c=rep(c("X","Y"),each=4),
d=c(1.2,1.1,1.15,1.1, -1.1,-1.05,-1.2,-1.08))
The plot is actually pretty easy to fix once you know the problem. First, set [b] to your x-axis, and add [a] to your facet. Then remove all of the annoying gibberish from [b] using the 'theme' with blank elements:
p <- ggplot(NULL, aes(x=b, y=d)) +
facet_grid(. ~ c + a) +
geom_bar(data = df, stat="identity", position="dodge") +
theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.title.x = element_blank())
print(p)
If this isn't exactly what you want, it should be at least close enough that you'll only have to do cosmetic changes. Good luck!

Resources