I got some data (named result.df) which looks like the following:
orgaName abundance pVal score
A 3 9.998622e-01 1.795338e-04
B 2 9.999790e-01 1.823428e-05
C 1 2.225074e-308 3.076527e+02
D 1 3.510957e-01 4.545745e-01
and so on...
What I am now plotting is this:
p1 <- ggplot(result.df, aes(log2(abundance), (1-pVal), label=orgaName)) +
ylab("1 - P-Value")+
xlab("log2(abundance)")+
geom_point(aes(size=score))+
ggtitle(colnames(case.count.matrix)[i])+
geom_text(data=subset(result.df, pVal < 0.05),hjust=.65, vjust=-1.2,size=2.5)+
geom_hline(aes(yintercept=.95), colour="blue", linetype="dashed")+
theme_classic()
Everything works fine and looks rather fine. However, what I would like is to scale the point size introduced through
geom_point(aes(size=score))+
to be scaled against fixed values. So the legend should scale in a decadic logarithm but the score should stay the same. Such that low scores nearly disappear and large scores are kind of comparable with respect to their point size between different "result.df".
EDIT
After checking on the comments of #roman and #vrajs5 I was able to produce a plot like this .
Using the following code:
ggplot(result.df, aes(log2(abundance), (1-pVal), label=orgaName)) +
ylab("1 - P-Value")+
xlab("log2(abundance)")+
geom_point(aes(size=score))+
ggtitle(colnames(case.count.matrix)[i])+
#geom_text(data=subset(result.df, pVal < 0.05 & log2(abundance) > xInt),hjust=.65, vjust=-1.2,size=2.5)+
geom_text(data=subset(result.df, pVal < 0.05),hjust=.65, vjust=-1.2,size=2.5)+
geom_hline(aes(yintercept=.95), colour="blue", linetype="dashed")+
#geom_vline(aes(xintercept=xInt), colour="blue", linetype="dashed")+
#geom_text(data=subset(result.df, pVal > 0.05 & log2(abundance) > xInt),alpha=.5,hjust=.65, vjust=-1.2,size=2)+
#geom_text(data=subset(result.df, pVal < 0.05 & log2(abundance) < xInt),alpha=.5,hjust=.65, vjust=-1.2,size=2)+
theme_classic() +
scale_size(range=c(2,12),expand=c(2,0),breaks=c(0,1,10,100,1000,1000000),labels=c(">=0",">=1",">=10",">=100",">=1000",">=1000000"),guide="legend")
As you can see, the breaks are introduced and labeled as intendet. However the point size in the legend does not reflect the point sizes in the plot. Any idea how to fix this?
As #Roman mentioned, if you use scale_size you can specify the limits on size..
Following is the example how to control size of points
result.df = read.table(text = 'orgaName abundance pVal score
A 3 9.998622e-01 1.795338e-04
B 2 9.999790e-01 1.823428e-05
C 1 2.225074e-308 3.076527e+02
D 1 3.510957e-01 4.545745e-01
E 3 2.510957e-01 2.545745e+00
F 3 1.510957e-02 2.006527e+02
G 2 5.510957e-01 3.545745e-02', header = T)
library(ggplot2)
ggplot(result.df, aes(log2(abundance), (1-pVal), label=orgaName)) +
ylab("1 - P-Value")+
xlab("log2(abundance)")+
geom_point(aes(size=score))+
#ggtitle(colnames(case.count.matrix)[i])+
geom_text(data=subset(result.df, pVal < 0.05),hjust=.65, vjust=-1.2,size=2.5)+
geom_hline(aes(yintercept=.95), colour="blue", linetype="dashed")+
theme_classic() +
scale_size(range = c(2,12))
Output graph is
Related
I have a dataframe containing 2479 peptides with their sequence, p-value and logfold change.
# A tibble: 6 x 3
Sequence p log2fold
<chr> <dbl> <dbl>
1 FLENEDR 0.343 1.21
2 DTEEEDFHVDQATTVK 0.270 0.771
3 DTEEEDFHVDQATTVK 0.112 1.18
4 SCRASQSVSSSF 0.798 0.139
5 RLSCTTSGF 0.739 0.110
6 SCRASQSVSSSY 0.209 0.375
I'm trying to make a volcano plot while labelling the up and downregulated peptides. However, for some reason, ggplot only uses 6 labels. I have no idea why.
I have trying loads of different things. I tried using up and downregulation in expression column, I tried increasing and decreasing my cut-off values to check if this was a problem. I used ggrepel to try and center them out more. Nothing seems to be working. My latest tries with the code is in this code.
Basically as a last resort I made a new group and only took the significant and fold change peptides with me, resulting in 39 peptides. Then I used this as header and matched peptides between the two dataframes.
Another problem that arises is in my legend, a character appears since using geom_text_repel. I have no idea how or why this is happening.
library(ggplot2)
library(ggrepel)
library(tidyverse)
Volc <- R_volcano
expression <- ifelse(Volc$p < 0.05 & abs (Volc$log2fold) >=1, ifelse(Volc$log2fold>1, 'up', 'down'), 'stable')
Volc <- cbind(Volc, expression)
colnames(Volc)[1] <- 'Sequencenames'
Volc["group"] <- "NotSignificant"
Volc[which(Volc['p'] < 0.05 & abs(Volc['log2fold']) < 1 ),"group"] <- "Significant"
Volc[which(Volc['p'] > 0.05 & abs(Volc['log2fold']) > 1 ),"group"] <- "FoldChange"
Volc[which(Volc['p'] < 0.05 & abs(Volc['log2fold']) > 1 ),"group"] <- "Significant&FoldChange"
VolcFilter <- Volc %>% filter(group=="Significant&FoldChange")
p <- ggplot(data = Volc, aes(x = log2fold, y = -log10(p), colour=expression, label='Sequencenames')) +
geom_point(alpha=0.4, size=2) +
scale_color_manual(values=c("blue", "grey","red"))+
xlim(c(-4.5, 4.5)) +
geom_vline(xintercept=c(-1,1),lty=4,col="black",lwd=0.8) +
geom_hline(yintercept = 1.301,lty=4,col="black",lwd=0.8) +
geom_text_repel(data=head(VolcFilter), aes(label=Sequencenames))+
labs(x="log2(fold change)",
y="-log10 (p-value)",
title="Differential expression") +
theme_bw()+
theme(plot.title = element_text(hjust = 0.5),
legend.position="right",
legend.title = element_blank())
p
Any help is much appreciated. Fairly new to R.
I have a data set which X values are integers from 1 to several thousandth and want to plot the mean Y and a measure of dispersion around that mean. The problem I have is that there are some missing X values. Therefore, when using the geom_line and geom_ribbon functions the plot is continuous and I can not find a way to make it leave blanks where there is no data.
Here is a mock up reproducible example.
data.1 <-read.csv(text = "
Treatment, X, Y_value
A,1,120.5
B,1,123.6
C,1,100.4
A,2,120.9
B,2,123.9
C,2,101.0
A,3,122.3
B,3,126.6
C,3,102.3
A,6,124.8
B,6,128.0
C,6,105.5
A,7,129.5
B,7,129.4
C,7,108.9
A,8,132.9
B,8,130.6
C,8,113.9
A,9,137.6
B,9,136.0
C,9,115.3
A,10,138.4
B,10,139.6
C,10,118.9
A,11,143.9
B,11,145.9
C,11,126.6
")
data.1 <- data.1 %>% group_by(X) %>% summarise(mean.y = mean(Y_value),
sd.y = sd(Y_value))
library(ggplot2)
ggplot(data.1, aes(X, mean.y)) +
geom_line(color="red") +
geom_ribbon(aes(ymin=mean.y-sd.y, ymax=mean.y+sd.y), alpha=0.4) +
scale_x_continuous(limits=c(0,11), breaks = c(seq(min(0),max(11), length.out = 12)))+
theme_bw() +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
Here is the output I am getting:
And this is what I would like to get:
Any hint on how to accomplish this would be really appreciated.
Thanks
You can add grouping column to mark X values above and below the cutoff. In this case, I've hard-coded the criterion, but in general you can do it programmatically if you have criteria for where the discontinuities should be.
For example:
ggplot(data.1, aes(X, mean.y, group=X<5)) +
geom_line(color="red") +
geom_ribbon(aes(ymin=mean.y-sd.y, ymax=mean.y+sd.y), alpha=0.4) +
scale_x_continuous(limits=c(0,11), breaks = 0:12) +
theme_bw() +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
Or, if our criterion is to have a discontinuity whenever the distance between x-values is greater than one:
data.1 %>%
mutate(g = c(0, cumsum(diff(X) > 1))) %>%
ggplot(aes(X, mean.y, group=g)) +
geom_line(color="red") +
geom_ribbon(aes(ymin=mean.y-sd.y, ymax=mean.y+sd.y), alpha=0.4) +
scale_x_continuous(limits=c(0,11), breaks = 0:12) +
theme_bw() +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
Either way, here's the resulting plot:
Here's some additional explanation to answer the question in the comment regarding how the mutate step creates the grouping column: We want to create a grouping variable that separates X values before and after a discontinuity. In the code above, we do that with a combination of the diff and cumsum functions.
diff calculates lagged differences. For example:
diff(data.1$X)
[1] 1 1 3 1 1 1 1 1
Note that one of the differences (the one between 3 and 6) is 3. Now let's add a logical condition:
diff(data.1$X) > 1
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
So now we have a vector of logical values where TRUE marks differences greater than one. cumsum will treat TRUE as equal to 1 and FALSE as equal to zero. The value of the cumulative sum will increment by one each time we encounter a TRUE, and will stay constant when we encounter a FALSE.
cumsum(diff(data.1$X) > 1)
[1] 0 0 1 1 1 1 1 1
Okay, now we have two groups, marking the X values before and after the discontinuity (if there are multiple discontinuities, we'll get a new group for each one). But we're not quite done.
Note that diff takes a vector of length n and returns a vector of length n-1. This is simply because there are only n-1 lagged differences between n values. Thus, we add a leading zero to get a vector that's the same length as the input data:
c(0, cumsum(diff(data.1$X) > 1))
[1] 0 0 0 1 1 1 1 1 1
Hi I'm very new in R and I'm struggling trying to modify an R code that I found on internet when learning how to make a volcano plot.
This code is to make volcano plots using ggplot2 and the problem I have is that I want to colour the up- and down-regulated proteins instead of colouring the proteins above the specified threshold. The code I'm using is the following:
install.packages("ggplot2")
gene_list <- read.table("/Users/Javi/Desktop/gene_list.csv", header=T, sep=",")
require(ggplot2)
##Highlight genes that have an absolute fold change > 2 and a p-value < 0.05
gene_list$threshold = as.factor(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.05)
##Construct the plot object
g = ggplot(data=gene_list, aes(x=logFC, y=-log10(P.Value), colour=my_palette)) +
geom_point(alpha=0.4, size=5) +
theme(legend.position = "none") +
xlim(c(-10, 10)) + ylim(c(0, 15)) +
xlab("log2 fold change") + ylab("-log10 p-value")
g
What I would like to do is to colour in red (for example) the logFC values > 1.3 and in blue the logFC values < -1.3
The csv file I'm using is just an example and would be something like this:
logFC P.Value
a 2 0.04
b 5 0.04
c 8 0.04
d 4 0.000005
e 7 0.01
f 1 0.04
g -6 0.0001
h -8 0.04
Thanks very much for your help in advance.
Cheers
Javi
Create a new color flag on your dataframe:
gene_list$color_flag <- ifelse(gene_list$logFC > 1.3, 1, ifelse(gene_list$logFC < -1.3, -1, 0))
Then add fill = color_flag to your aes.
df = data.frame(subj=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10), block=factor(rep(c(1,2),10)), acc=c(0.75,0.83,0.58,0.75,0.58,0.83,0.92,0.83,0.83,0.67,0.75,0.5,0.67,0.83,0.92,0.58,0.75,0.5,0.67,0.67))
ggplot(df,aes(block,acc,group=subj)) + geom_point(position=position_dodge(width=0.3)) + ylim(0,1) + labs(x='Block',y='Accuracy')
How do I get points to dodge each other uniformly in the horizontal direction? (I grouped by subj in order to get it to dodge at all, which might not be the correct thing to do...)
I think this might be what you were looking for, although no doubt you have solved it by now.
Hopefully it will help someone else with the same issue.
A simple way is to use geom_dotplot like this:
ggplot(df,aes(x=block,y=acc)) +
geom_dotplot(binaxis = "y", stackdir = "center", binwidth = 0.03) + ylim(0,1) + labs(x='Block',y='Accuracy')
This looks like this:
Note that x (block in this case) has to be a factor for this to work.
If they don't have to be perfectly aligned horizontally, here's one quick way of doing it, using geom_jitter. You don't need to group by subj.
Method 1 [Simpler]: Using geom_jitter()
ggplot(df,aes(x=block,y=acc)) + geom_jitter(position=position_jitter(0.05)) + ylim(0,1) + labs(x='Block',y='Accuracy')
Play with the jitter width for greater degree of jittering.
which produces:
Method 2: Deterministically calculating the jitter value for each row
We first use aggregate to count the number of duplicated entries. Then in a new data frame, for each duplicated value, move it horizontally to the left by an epsilon distance.
df$subj <- NULL #drop this so that aggregate works.
#a new data frame that shows duplicated values
agg.df <- aggregate(list(numdup=seq_len(nrow(df))), df, length)
agg.df$block <- as.numeric(agg.df$block) #block is not a factor
# block acc numdup
#1 2 0.50 2
#2 1 0.58 2
#3 2 0.58 1
#4 1 0.67 2
#...
epsilon <- 0.02 #jitter distance
new.df <- NULL #create an expanded dataframe, with block value jittered deterministically
r <- 0
for (i in 1:nrow(agg.df)) {
for (j in 1:agg.df$numdup[i]) {
r <- r+1 #row counter in the expanded df
new.df$block[r] <- agg.df$block[i]
new.df$acc[r] <- agg.df$acc[i]
new.df$jit.value[r] <- agg.df$block[i] - (j-1)*epsilon
}
}
new.df <- as.data.frame(new.df)
ggplot(new.df,aes(x=jit.value,y=acc)) + geom_point(size=2) + ylim(0,1) + labs(x='Block',y='Accuracy') + xlim(0,3)
which produces:
Here the updated example:
df <- data.frame(a=rep(c("A","B"),each=10),
b=rep(rep(c("C","D"),each=5),2),
c=c(sample(letters[1:5]), sample(letters[6:10]),
sample(letters[1:5]), sample(letters[6:10])),
d=c(0.10,0.18,0.34,0.35,0.59,0.16,0.38,0.40,0.53,0.58,
0.37,0.62,0.83,1.46,-0.91,-0.79,-0.52,-0.43,-0.01,0.34))
> df
a b c d
1 A C b 0.10
2 A C e 0.18
3 A C a 0.34
4 A C c 0.35
5 A C d 0.59
6 A D i 0.16
7 A D j 0.38
8 A D h 0.40
9 A D f 0.53
10 A D g 0.58
11 B C e 0.37
12 B C d 0.62
13 B C a 0.83
14 B C c 1.46
15 B C b -0.91
16 B D f -0.79
17 B D i -0.52
18 B D h -0.43
19 B D j -0.01
20 B D g 0.34
If you look closely, you will see that column d is ordered within column b always from smallest to largest.
The first plot is how I would like to have the plot apart from the fact, that the bars displayed are not in the order of d. So the bars do not appear from smallest to largest:
p <- ggplot(df, aes(x=c, y=d, fill=b, stat="identity")) +
facet_grid(. ~ a) +
geom_bar()
print(p)
This is because column c is a factor and the factors are apparently not ordered in the same order as column d is. So I did the following:
df$c <- paste(1:nrow(df), df$c, sep="_")
df$c <- factor(df$c, levels = unfactor(df$c))
p <- ggplot(df, aes(x=c, y=d, fill=b, stat="identity")) +
facet_grid(. ~ a) +
geom_bar()
print(p)
produces the following plot:
Here the order is correct. However, as you can see I created unique factors I get those spaces for the ones not present in A and B respectively.
How can I sort that out?
Now that you have changed the question, 'ggplot' cannot do this for you. By giving [df$c] levels, you could order the data but only based on the first set of [c] values. For instance:
df$c <- factor(df$c, levels=levels(df$c)[order(df$d)])
But that won't work, since you're trying to sort [df$c] twice (once for "A", and once for "B").
You really need to break this into two separate plots, and just plot the two viewports next to each other.
Setting up the viewports:
grid.newpage()
pushViewport(viewport(layout = grid.layout(1, 2)))
Plot A:
a_df <- df[df$a=="A",]
a_df$c <- factor(a_df$c, levels=levels(a_df$c)[order(a_df$d)])
a_p <- ggplot(a_df, aes(x=1:10, y=d, fill=b)) +
facet_grid(. ~ a) +
geom_bar(stat="identity", position="dodge")
print(a_p, vp = viewport(layout.pos.row=1, layout.pos.col=1))
Plot B:
b_df <- df[df$a=="B",]
b_df$c <- factor(b_df$c, levels=levels(b_df$c)[order(b_df$d)])
b_p <- ggplot(b_df, aes(x=1:10, y=d, fill=b)) +
facet_grid(. ~ a) +
geom_bar(stat="identity", position="dodge")
print(b_p, vp = viewport(layout.pos.row=1, layout.pos.col=2))
From here, you can worry about removing the excess legend, choosing which axes to label and such, but it looks exactly like your example plot only with the empty locations removed.
This is really an example of how 'ggplot' is sometimes more of a hindrance and less of a boon. In my experience, it's best to first design your plot and then choose the tool. Frequently, I find myself going back to raw 'grid' to do my visuals, because I want something the 'grid' wrapper 'ggplot' just won't do.
Note: In the future, don't delete your original question content; just add the updated info. Removing the old content makes a lot of the answers and comments on this page irrelevant.
I think this is actually a common mistake with the 'ggplot' function. If you set an outline color (i.e. aes(colour="red")), you will see that you are actually plotting all four values, but they are plotting on top of each other. The stacking warning is because the default value of 'position' is "stack". Just include the position="dodge" argument, and that will go away.
Now, to actually solve your problem. You need to give 'ggplot' something to distinguish between the values of X(A), X(B), Y(A), and Y(B). At first glance, you might be tempted to use your [b] values, but you don't want all of the extra spaces. Let's adjust your dataframe to have only 1s and 2s for [b]:
df <- data.frame(a=rep(rep(c("A","B"),each=2),2),
b=rep(1:2,4),
c=rep(c("X","Y"),each=4),
d=c(1.2,1.1,1.15,1.1, -1.1,-1.05,-1.2,-1.08))
The plot is actually pretty easy to fix once you know the problem. First, set [b] to your x-axis, and add [a] to your facet. Then remove all of the annoying gibberish from [b] using the 'theme' with blank elements:
p <- ggplot(NULL, aes(x=b, y=d)) +
facet_grid(. ~ c + a) +
geom_bar(data = df, stat="identity", position="dodge") +
theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.title.x = element_blank())
print(p)
If this isn't exactly what you want, it should be at least close enough that you'll only have to do cosmetic changes. Good luck!