Ggplot does not label all interesting peptides for volcano plot - r

I have a dataframe containing 2479 peptides with their sequence, p-value and logfold change.
# A tibble: 6 x 3
Sequence p log2fold
<chr> <dbl> <dbl>
1 FLENEDR 0.343 1.21
2 DTEEEDFHVDQATTVK 0.270 0.771
3 DTEEEDFHVDQATTVK 0.112 1.18
4 SCRASQSVSSSF 0.798 0.139
5 RLSCTTSGF 0.739 0.110
6 SCRASQSVSSSY 0.209 0.375
I'm trying to make a volcano plot while labelling the up and downregulated peptides. However, for some reason, ggplot only uses 6 labels. I have no idea why.
I have trying loads of different things. I tried using up and downregulation in expression column, I tried increasing and decreasing my cut-off values to check if this was a problem. I used ggrepel to try and center them out more. Nothing seems to be working. My latest tries with the code is in this code.
Basically as a last resort I made a new group and only took the significant and fold change peptides with me, resulting in 39 peptides. Then I used this as header and matched peptides between the two dataframes.
Another problem that arises is in my legend, a character appears since using geom_text_repel. I have no idea how or why this is happening.
library(ggplot2)
library(ggrepel)
library(tidyverse)
Volc <- R_volcano
expression <- ifelse(Volc$p < 0.05 & abs (Volc$log2fold) >=1, ifelse(Volc$log2fold>1, 'up', 'down'), 'stable')
Volc <- cbind(Volc, expression)
colnames(Volc)[1] <- 'Sequencenames'
Volc["group"] <- "NotSignificant"
Volc[which(Volc['p'] < 0.05 & abs(Volc['log2fold']) < 1 ),"group"] <- "Significant"
Volc[which(Volc['p'] > 0.05 & abs(Volc['log2fold']) > 1 ),"group"] <- "FoldChange"
Volc[which(Volc['p'] < 0.05 & abs(Volc['log2fold']) > 1 ),"group"] <- "Significant&FoldChange"
VolcFilter <- Volc %>% filter(group=="Significant&FoldChange")
p <- ggplot(data = Volc, aes(x = log2fold, y = -log10(p), colour=expression, label='Sequencenames')) +
geom_point(alpha=0.4, size=2) +
scale_color_manual(values=c("blue", "grey","red"))+
xlim(c(-4.5, 4.5)) +
geom_vline(xintercept=c(-1,1),lty=4,col="black",lwd=0.8) +
geom_hline(yintercept = 1.301,lty=4,col="black",lwd=0.8) +
geom_text_repel(data=head(VolcFilter), aes(label=Sequencenames))+
labs(x="log2(fold change)",
y="-log10 (p-value)",
title="Differential expression") +
theme_bw()+
theme(plot.title = element_text(hjust = 0.5),
legend.position="right",
legend.title = element_blank())
p
Any help is much appreciated. Fairly new to R.

Related

stat_density2d - What does the legend mean?

I have a map done in R with stat_density2d. This is the code:
ggplot(data, aes(x=Lon, y=Lat)) +
stat_density2d(aes(fill = ..level..), alpha=0.5, geom="polygon",show.legend=FALSE)+
geom_point(colour="red")+
geom_path(data=map.df,aes(x=long, y=lat, group=group), colour="grey50")+
scale_fill_gradientn(colours=rev(brewer.pal(7,"Spectral")))+
xlim(-10,+2.5) +
ylim(+47,+60) +
coord_fixed(1.7) +
theme_void()
And it produces this:
Great. It works. However I do not know what the legend means. I did find this wikipedia page:
https://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation
And the example they used (which contains red, orange and yellow) stated:
The coloured contours correspond to the smallest region which contains
the respective probability mass: red = 25%, orange + red = 50%, yellow
+ orange + red = 75%
However, using stat_density2d, I have 11 contours in my map. Does anyone know how stat_density2d works and what the legend means? Ideally I wanted to be able to state something like the red contour contains 25% of the plots etc.
I have read this: https://ggplot2.tidyverse.org/reference/geom_density_2d.html and I am still none the wiser.
Let's take the faithful example from ggplot2:
ggplot(faithful, aes(x = eruptions, y = waiting)) +
stat_density_2d(aes(fill = factor(stat(level))), geom = "polygon") +
geom_point() +
xlim(0.5, 6) +
ylim(40, 110)
(apologies in advance for not making this prettier)
The level is the height at which the 3D "mountains" were sliced. I don't know of a way (others might) to translate that to a percentage but I do know to get you said percentages.
If we look at that chart, level 0.002 contains the vast majority of the points (all but 2). Level 0.004 is actually 2 polygons and they contain all but ~dozen of the points. If I'm getting the gist of what you're asking that's what you want to know, except not count but the percentage of points encompassed by polygons at a given level. That's straightforward to compute using the methodology from the various ggplot2 "stats" involved.
Note that while we're importing the tidyverse and sp packages we'll use some other functions fully-qualified. Now, let's reshape the faithful data a bit:
library(tidyverse)
library(sp)
xdf <- select(faithful, x = eruptions, y = waiting)
(easier to type x and y)
Now, we'll compute the two-dimensional kernel density estimation the way ggplot2 does:
h <- c(MASS::bandwidth.nrd(xdf$x), MASS::bandwidth.nrd(xdf$y))
dens <- MASS::kde2d(
xdf$x, xdf$y, h = h, n = 100,
lims = c(0.5, 6, 40, 110)
)
breaks <- pretty(range(zdf$z), 10)
zdf <- data.frame(expand.grid(x = dens$x, y = dens$y), z = as.vector(dens$z))
z <- tapply(zdf$z, zdf[c("x", "y")], identity)
cl <- grDevices::contourLines(
x = sort(unique(dens$x)), y = sort(unique(dens$y)), z = dens$z,
levels = breaks
)
I won't clutter the answer with str() output but it's kinda fun looking at what happens there.
We can use spatial ops to figure out how many points fall within given polygons, then we can group the polygons at the same level to provide counts and percentages per-level:
SpatialPolygons(
lapply(1:length(cl), function(idx) {
Polygons(
srl = list(Polygon(
matrix(c(cl[[idx]]$x, cl[[idx]]$y), nrow=length(cl[[idx]]$x), byrow=FALSE)
)),
ID = idx
)
})
) -> cont
coordinates(xdf) <- ~x+y
data_frame(
ct = sapply(over(cont, geometry(xdf), returnList = TRUE), length),
id = 1:length(ct),
lvl = sapply(cl, function(x) x$level)
) %>%
count(lvl, wt=ct) %>%
mutate(
pct = n/length(xdf),
pct_lab = sprintf("%s of the points fall within this level", scales::percent(pct))
)
## # A tibble: 12 x 4
## lvl n pct pct_lab
## <dbl> <int> <dbl> <chr>
## 1 0.002 270 0.993 99.3% of the points fall within this level
## 2 0.004 259 0.952 95.2% of the points fall within this level
## 3 0.006 249 0.915 91.5% of the points fall within this level
## 4 0.008 232 0.853 85.3% of the points fall within this level
## 5 0.01 206 0.757 75.7% of the points fall within this level
## 6 0.012 175 0.643 64.3% of the points fall within this level
## 7 0.014 145 0.533 53.3% of the points fall within this level
## 8 0.016 94 0.346 34.6% of the points fall within this level
## 9 0.018 81 0.298 29.8% of the points fall within this level
## 10 0.02 60 0.221 22.1% of the points fall within this level
## 11 0.022 43 0.158 15.8% of the points fall within this level
## 12 0.024 13 0.0478 4.8% of the points fall within this level
I only spelled it out to avoid blathering more but the percentages will change depending on how you modify the various parameters to the density computation (same holds true for my ggalt::geom_bkde2d() which uses a different estimator).
If there is a way to tease out the percentages without re-performing the calculations there's no better way to have that pointed out than by letting other SO R folks show how much more clever they are than the person writing this answer (hopefully in more diplomatic ways than seem to be the mode of late).

geom_point points manual scaling

I got some data (named result.df) which looks like the following:
orgaName abundance pVal score
A 3 9.998622e-01 1.795338e-04
B 2 9.999790e-01 1.823428e-05
C 1 2.225074e-308 3.076527e+02
D 1 3.510957e-01 4.545745e-01
and so on...
What I am now plotting is this:
p1 <- ggplot(result.df, aes(log2(abundance), (1-pVal), label=orgaName)) +
ylab("1 - P-Value")+
xlab("log2(abundance)")+
geom_point(aes(size=score))+
ggtitle(colnames(case.count.matrix)[i])+
geom_text(data=subset(result.df, pVal < 0.05),hjust=.65, vjust=-1.2,size=2.5)+
geom_hline(aes(yintercept=.95), colour="blue", linetype="dashed")+
theme_classic()
Everything works fine and looks rather fine. However, what I would like is to scale the point size introduced through
geom_point(aes(size=score))+
to be scaled against fixed values. So the legend should scale in a decadic logarithm but the score should stay the same. Such that low scores nearly disappear and large scores are kind of comparable with respect to their point size between different "result.df".
EDIT
After checking on the comments of #roman and #vrajs5 I was able to produce a plot like this .
Using the following code:
ggplot(result.df, aes(log2(abundance), (1-pVal), label=orgaName)) +
ylab("1 - P-Value")+
xlab("log2(abundance)")+
geom_point(aes(size=score))+
ggtitle(colnames(case.count.matrix)[i])+
#geom_text(data=subset(result.df, pVal < 0.05 & log2(abundance) > xInt),hjust=.65, vjust=-1.2,size=2.5)+
geom_text(data=subset(result.df, pVal < 0.05),hjust=.65, vjust=-1.2,size=2.5)+
geom_hline(aes(yintercept=.95), colour="blue", linetype="dashed")+
#geom_vline(aes(xintercept=xInt), colour="blue", linetype="dashed")+
#geom_text(data=subset(result.df, pVal > 0.05 & log2(abundance) > xInt),alpha=.5,hjust=.65, vjust=-1.2,size=2)+
#geom_text(data=subset(result.df, pVal < 0.05 & log2(abundance) < xInt),alpha=.5,hjust=.65, vjust=-1.2,size=2)+
theme_classic() +
scale_size(range=c(2,12),expand=c(2,0),breaks=c(0,1,10,100,1000,1000000),labels=c(">=0",">=1",">=10",">=100",">=1000",">=1000000"),guide="legend")
As you can see, the breaks are introduced and labeled as intendet. However the point size in the legend does not reflect the point sizes in the plot. Any idea how to fix this?
As #Roman mentioned, if you use scale_size you can specify the limits on size..
Following is the example how to control size of points
result.df = read.table(text = 'orgaName abundance pVal score
A 3 9.998622e-01 1.795338e-04
B 2 9.999790e-01 1.823428e-05
C 1 2.225074e-308 3.076527e+02
D 1 3.510957e-01 4.545745e-01
E 3 2.510957e-01 2.545745e+00
F 3 1.510957e-02 2.006527e+02
G 2 5.510957e-01 3.545745e-02', header = T)
library(ggplot2)
ggplot(result.df, aes(log2(abundance), (1-pVal), label=orgaName)) +
ylab("1 - P-Value")+
xlab("log2(abundance)")+
geom_point(aes(size=score))+
#ggtitle(colnames(case.count.matrix)[i])+
geom_text(data=subset(result.df, pVal < 0.05),hjust=.65, vjust=-1.2,size=2.5)+
geom_hline(aes(yintercept=.95), colour="blue", linetype="dashed")+
theme_classic() +
scale_size(range = c(2,12))
Output graph is

Colour regulated genes in Volcano plot - ggplot2

Hi I'm very new in R and I'm struggling trying to modify an R code that I found on internet when learning how to make a volcano plot.
This code is to make volcano plots using ggplot2 and the problem I have is that I want to colour the up- and down-regulated proteins instead of colouring the proteins above the specified threshold. The code I'm using is the following:
install.packages("ggplot2")
gene_list <- read.table("/Users/Javi/Desktop/gene_list.csv", header=T, sep=",")
require(ggplot2)
##Highlight genes that have an absolute fold change > 2 and a p-value < 0.05
gene_list$threshold = as.factor(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.05)
##Construct the plot object
g = ggplot(data=gene_list, aes(x=logFC, y=-log10(P.Value), colour=my_palette)) +
geom_point(alpha=0.4, size=5) +
theme(legend.position = "none") +
xlim(c(-10, 10)) + ylim(c(0, 15)) +
xlab("log2 fold change") + ylab("-log10 p-value")
g
What I would like to do is to colour in red (for example) the logFC values > 1.3 and in blue the logFC values < -1.3
The csv file I'm using is just an example and would be something like this:
logFC P.Value
a 2 0.04
b 5 0.04
c 8 0.04
d 4 0.000005
e 7 0.01
f 1 0.04
g -6 0.0001
h -8 0.04
Thanks very much for your help in advance.
Cheers
Javi
Create a new color flag on your dataframe:
gene_list$color_flag <- ifelse(gene_list$logFC > 1.3, 1, ifelse(gene_list$logFC < -1.3, -1, 0))
Then add fill = color_flag to your aes.

How to dodge points in ggplot2 in R

df = data.frame(subj=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10), block=factor(rep(c(1,2),10)), acc=c(0.75,0.83,0.58,0.75,0.58,0.83,0.92,0.83,0.83,0.67,0.75,0.5,0.67,0.83,0.92,0.58,0.75,0.5,0.67,0.67))
ggplot(df,aes(block,acc,group=subj)) + geom_point(position=position_dodge(width=0.3)) + ylim(0,1) + labs(x='Block',y='Accuracy')
How do I get points to dodge each other uniformly in the horizontal direction? (I grouped by subj in order to get it to dodge at all, which might not be the correct thing to do...)
I think this might be what you were looking for, although no doubt you have solved it by now.
Hopefully it will help someone else with the same issue.
A simple way is to use geom_dotplot like this:
ggplot(df,aes(x=block,y=acc)) +
geom_dotplot(binaxis = "y", stackdir = "center", binwidth = 0.03) + ylim(0,1) + labs(x='Block',y='Accuracy')
This looks like this:
Note that x (block in this case) has to be a factor for this to work.
If they don't have to be perfectly aligned horizontally, here's one quick way of doing it, using geom_jitter. You don't need to group by subj.
Method 1 [Simpler]: Using geom_jitter()
ggplot(df,aes(x=block,y=acc)) + geom_jitter(position=position_jitter(0.05)) + ylim(0,1) + labs(x='Block',y='Accuracy')
Play with the jitter width for greater degree of jittering.
which produces:
Method 2: Deterministically calculating the jitter value for each row
We first use aggregate to count the number of duplicated entries. Then in a new data frame, for each duplicated value, move it horizontally to the left by an epsilon distance.
df$subj <- NULL #drop this so that aggregate works.
#a new data frame that shows duplicated values
agg.df <- aggregate(list(numdup=seq_len(nrow(df))), df, length)
agg.df$block <- as.numeric(agg.df$block) #block is not a factor
# block acc numdup
#1 2 0.50 2
#2 1 0.58 2
#3 2 0.58 1
#4 1 0.67 2
#...
epsilon <- 0.02 #jitter distance
new.df <- NULL #create an expanded dataframe, with block value jittered deterministically
r <- 0
for (i in 1:nrow(agg.df)) {
for (j in 1:agg.df$numdup[i]) {
r <- r+1 #row counter in the expanded df
new.df$block[r] <- agg.df$block[i]
new.df$acc[r] <- agg.df$acc[i]
new.df$jit.value[r] <- agg.df$block[i] - (j-1)*epsilon
}
}
new.df <- as.data.frame(new.df)
ggplot(new.df,aes(x=jit.value,y=acc)) + geom_point(size=2) + ylim(0,1) + labs(x='Block',y='Accuracy') + xlim(0,3)
which produces:

Troubles with ggplot and geom_bar

Here the updated example:
df <- data.frame(a=rep(c("A","B"),each=10),
b=rep(rep(c("C","D"),each=5),2),
c=c(sample(letters[1:5]), sample(letters[6:10]),
sample(letters[1:5]), sample(letters[6:10])),
d=c(0.10,0.18,0.34,0.35,0.59,0.16,0.38,0.40,0.53,0.58,
0.37,0.62,0.83,1.46,-0.91,-0.79,-0.52,-0.43,-0.01,0.34))
> df
a b c d
1 A C b 0.10
2 A C e 0.18
3 A C a 0.34
4 A C c 0.35
5 A C d 0.59
6 A D i 0.16
7 A D j 0.38
8 A D h 0.40
9 A D f 0.53
10 A D g 0.58
11 B C e 0.37
12 B C d 0.62
13 B C a 0.83
14 B C c 1.46
15 B C b -0.91
16 B D f -0.79
17 B D i -0.52
18 B D h -0.43
19 B D j -0.01
20 B D g 0.34
If you look closely, you will see that column d is ordered within column b always from smallest to largest.
The first plot is how I would like to have the plot apart from the fact, that the bars displayed are not in the order of d. So the bars do not appear from smallest to largest:
p <- ggplot(df, aes(x=c, y=d, fill=b, stat="identity")) +
facet_grid(. ~ a) +
geom_bar()
print(p)
This is because column c is a factor and the factors are apparently not ordered in the same order as column d is. So I did the following:
df$c <- paste(1:nrow(df), df$c, sep="_")
df$c <- factor(df$c, levels = unfactor(df$c))
p <- ggplot(df, aes(x=c, y=d, fill=b, stat="identity")) +
facet_grid(. ~ a) +
geom_bar()
print(p)
produces the following plot:
Here the order is correct. However, as you can see I created unique factors I get those spaces for the ones not present in A and B respectively.
How can I sort that out?
Now that you have changed the question, 'ggplot' cannot do this for you. By giving [df$c] levels, you could order the data but only based on the first set of [c] values. For instance:
df$c <- factor(df$c, levels=levels(df$c)[order(df$d)])
But that won't work, since you're trying to sort [df$c] twice (once for "A", and once for "B").
You really need to break this into two separate plots, and just plot the two viewports next to each other.
Setting up the viewports:
grid.newpage()
pushViewport(viewport(layout = grid.layout(1, 2)))
Plot A:
a_df <- df[df$a=="A",]
a_df$c <- factor(a_df$c, levels=levels(a_df$c)[order(a_df$d)])
a_p <- ggplot(a_df, aes(x=1:10, y=d, fill=b)) +
facet_grid(. ~ a) +
geom_bar(stat="identity", position="dodge")
print(a_p, vp = viewport(layout.pos.row=1, layout.pos.col=1))
Plot B:
b_df <- df[df$a=="B",]
b_df$c <- factor(b_df$c, levels=levels(b_df$c)[order(b_df$d)])
b_p <- ggplot(b_df, aes(x=1:10, y=d, fill=b)) +
facet_grid(. ~ a) +
geom_bar(stat="identity", position="dodge")
print(b_p, vp = viewport(layout.pos.row=1, layout.pos.col=2))
From here, you can worry about removing the excess legend, choosing which axes to label and such, but it looks exactly like your example plot only with the empty locations removed.
This is really an example of how 'ggplot' is sometimes more of a hindrance and less of a boon. In my experience, it's best to first design your plot and then choose the tool. Frequently, I find myself going back to raw 'grid' to do my visuals, because I want something the 'grid' wrapper 'ggplot' just won't do.
Note: In the future, don't delete your original question content; just add the updated info. Removing the old content makes a lot of the answers and comments on this page irrelevant.
I think this is actually a common mistake with the 'ggplot' function. If you set an outline color (i.e. aes(colour="red")), you will see that you are actually plotting all four values, but they are plotting on top of each other. The stacking warning is because the default value of 'position' is "stack". Just include the position="dodge" argument, and that will go away.
Now, to actually solve your problem. You need to give 'ggplot' something to distinguish between the values of X(A), X(B), Y(A), and Y(B). At first glance, you might be tempted to use your [b] values, but you don't want all of the extra spaces. Let's adjust your dataframe to have only 1s and 2s for [b]:
df <- data.frame(a=rep(rep(c("A","B"),each=2),2),
b=rep(1:2,4),
c=rep(c("X","Y"),each=4),
d=c(1.2,1.1,1.15,1.1, -1.1,-1.05,-1.2,-1.08))
The plot is actually pretty easy to fix once you know the problem. First, set [b] to your x-axis, and add [a] to your facet. Then remove all of the annoying gibberish from [b] using the 'theme' with blank elements:
p <- ggplot(NULL, aes(x=b, y=d)) +
facet_grid(. ~ c + a) +
geom_bar(data = df, stat="identity", position="dodge") +
theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.title.x = element_blank())
print(p)
If this isn't exactly what you want, it should be at least close enough that you'll only have to do cosmetic changes. Good luck!

Resources