Troubles with ggplot and geom_bar

Troubles with ggplot and geom_bar - r

Here the updated example:
df <- data.frame(a=rep(c("A","B"),each=10),
b=rep(rep(c("C","D"),each=5),2),
c=c(sample(letters[1:5]), sample(letters[6:10]),
sample(letters[1:5]), sample(letters[6:10])),
d=c(0.10,0.18,0.34,0.35,0.59,0.16,0.38,0.40,0.53,0.58,
0.37,0.62,0.83,1.46,-0.91,-0.79,-0.52,-0.43,-0.01,0.34))
> df
a b c d
1 A C b 0.10
2 A C e 0.18
3 A C a 0.34
4 A C c 0.35
5 A C d 0.59
6 A D i 0.16
7 A D j 0.38
8 A D h 0.40
9 A D f 0.53
10 A D g 0.58
11 B C e 0.37
12 B C d 0.62
13 B C a 0.83
14 B C c 1.46
15 B C b -0.91
16 B D f -0.79
17 B D i -0.52
18 B D h -0.43
19 B D j -0.01
20 B D g 0.34
If you look closely, you will see that column d is ordered within column b always from smallest to largest.
The first plot is how I would like to have the plot apart from the fact, that the bars displayed are not in the order of d. So the bars do not appear from smallest to largest:
p <- ggplot(df, aes(x=c, y=d, fill=b, stat="identity")) +
facet_grid(. ~ a) +
geom_bar()
print(p)
This is because column c is a factor and the factors are apparently not ordered in the same order as column d is. So I did the following:
df$c <- paste(1:nrow(df), df$c, sep="_")
df$c <- factor(df$c, levels = unfactor(df$c))
p <- ggplot(df, aes(x=c, y=d, fill=b, stat="identity")) +
facet_grid(. ~ a) +
geom_bar()
print(p)
produces the following plot:
Here the order is correct. However, as you can see I created unique factors I get those spaces for the ones not present in A and B respectively.
How can I sort that out?

Now that you have changed the question, 'ggplot' cannot do this for you. By giving [df$c] levels, you could order the data but only based on the first set of [c] values. For instance:
df$c <- factor(df$c, levels=levels(df$c)[order(df$d)])
But that won't work, since you're trying to sort [df$c] twice (once for "A", and once for "B").
You really need to break this into two separate plots, and just plot the two viewports next to each other.
Setting up the viewports:
grid.newpage()
pushViewport(viewport(layout = grid.layout(1, 2)))
Plot A:
a_df <- df[df$a=="A",]
a_df$c <- factor(a_df$c, levels=levels(a_df$c)[order(a_df$d)])
a_p <- ggplot(a_df, aes(x=1:10, y=d, fill=b)) +
facet_grid(. ~ a) +
geom_bar(stat="identity", position="dodge")
print(a_p, vp = viewport(layout.pos.row=1, layout.pos.col=1))
Plot B:
b_df <- df[df$a=="B",]
b_df$c <- factor(b_df$c, levels=levels(b_df$c)[order(b_df$d)])
b_p <- ggplot(b_df, aes(x=1:10, y=d, fill=b)) +
facet_grid(. ~ a) +
geom_bar(stat="identity", position="dodge")
print(b_p, vp = viewport(layout.pos.row=1, layout.pos.col=2))
From here, you can worry about removing the excess legend, choosing which axes to label and such, but it looks exactly like your example plot only with the empty locations removed.
This is really an example of how 'ggplot' is sometimes more of a hindrance and less of a boon. In my experience, it's best to first design your plot and then choose the tool. Frequently, I find myself going back to raw 'grid' to do my visuals, because I want something the 'grid' wrapper 'ggplot' just won't do.
Note: In the future, don't delete your original question content; just add the updated info. Removing the old content makes a lot of the answers and comments on this page irrelevant.

I think this is actually a common mistake with the 'ggplot' function. If you set an outline color (i.e. aes(colour="red")), you will see that you are actually plotting all four values, but they are plotting on top of each other. The stacking warning is because the default value of 'position' is "stack". Just include the position="dodge" argument, and that will go away.
Now, to actually solve your problem. You need to give 'ggplot' something to distinguish between the values of X(A), X(B), Y(A), and Y(B). At first glance, you might be tempted to use your [b] values, but you don't want all of the extra spaces. Let's adjust your dataframe to have only 1s and 2s for [b]:
df <- data.frame(a=rep(rep(c("A","B"),each=2),2),
b=rep(1:2,4),
c=rep(c("X","Y"),each=4),
d=c(1.2,1.1,1.15,1.1, -1.1,-1.05,-1.2,-1.08))
The plot is actually pretty easy to fix once you know the problem. First, set [b] to your x-axis, and add [a] to your facet. Then remove all of the annoying gibberish from [b] using the 'theme' with blank elements:
p <- ggplot(NULL, aes(x=b, y=d)) +
facet_grid(. ~ c + a) +
geom_bar(data = df, stat="identity", position="dodge") +
theme(axis.ticks = element_blank(), axis.text.x = element_blank(), axis.title.x = element_blank())
print(p)
If this isn't exactly what you want, it should be at least close enough that you'll only have to do cosmetic changes. Good luck!

Related

Ggplot does not label all interesting peptides for volcano plot

I have a dataframe containing 2479 peptides with their sequence, p-value and logfold change.
# A tibble: 6 x 3
Sequence p log2fold
<chr> <dbl> <dbl>
1 FLENEDR 0.343 1.21
2 DTEEEDFHVDQATTVK 0.270 0.771
3 DTEEEDFHVDQATTVK 0.112 1.18
4 SCRASQSVSSSF 0.798 0.139
5 RLSCTTSGF 0.739 0.110
6 SCRASQSVSSSY 0.209 0.375
I'm trying to make a volcano plot while labelling the up and downregulated peptides. However, for some reason, ggplot only uses 6 labels. I have no idea why.
I have trying loads of different things. I tried using up and downregulation in expression column, I tried increasing and decreasing my cut-off values to check if this was a problem. I used ggrepel to try and center them out more. Nothing seems to be working. My latest tries with the code is in this code.
Basically as a last resort I made a new group and only took the significant and fold change peptides with me, resulting in 39 peptides. Then I used this as header and matched peptides between the two dataframes.
Another problem that arises is in my legend, a character appears since using geom_text_repel. I have no idea how or why this is happening.
library(ggplot2)
library(ggrepel)
library(tidyverse)
Volc <- R_volcano
expression <- ifelse(Volc$p < 0.05 & abs (Volc$log2fold) >=1, ifelse(Volc$log2fold>1, 'up', 'down'), 'stable')
Volc <- cbind(Volc, expression)
colnames(Volc)[1] <- 'Sequencenames'
Volc["group"] <- "NotSignificant"
Volc[which(Volc['p'] < 0.05 & abs(Volc['log2fold']) < 1 ),"group"] <- "Significant"
Volc[which(Volc['p'] > 0.05 & abs(Volc['log2fold']) > 1 ),"group"] <- "FoldChange"
Volc[which(Volc['p'] < 0.05 & abs(Volc['log2fold']) > 1 ),"group"] <- "Significant&FoldChange"
VolcFilter <- Volc %>% filter(group=="Significant&FoldChange")
p <- ggplot(data = Volc, aes(x = log2fold, y = -log10(p), colour=expression, label='Sequencenames')) +
geom_point(alpha=0.4, size=2) +
scale_color_manual(values=c("blue", "grey","red"))+
xlim(c(-4.5, 4.5)) +
geom_vline(xintercept=c(-1,1),lty=4,col="black",lwd=0.8) +
geom_hline(yintercept = 1.301,lty=4,col="black",lwd=0.8) +
geom_text_repel(data=head(VolcFilter), aes(label=Sequencenames))+
labs(x="log2(fold change)",
y="-log10 (p-value)",
title="Differential expression") +
theme_bw()+
theme(plot.title = element_text(hjust = 0.5),
legend.position="right",
legend.title = element_blank())
p
Any help is much appreciated. Fairly new to R.

R ggplot2: colors for stat_ellipse and for categorical variables with several unique values

I have an R dataframe similar to the following:
group x y trait1 grp.x grp.y
ind1 3 -1.35155328 2.5388350 A -1.1778170 2.2361359
ind2 3 -1.38344150 2.1475588 B -1.1778170 2.2361359
ind3 3 -1.38859652 2.4959691 B -1.1778170 2.2361359
ind4 2 -0.09222147 -1.6956082 A -0.8312698 -1.1864784
ind5 2 -0.51072944 -0.4015302 B -0.8312698 -1.1864784
ind6 3 -1.33953852 2.3398774 B -1.1778170 2.2361359
ind7 3 -1.33566078 1.7296098 B -1.1778170 2.2361359
ind8 2 -0.58546568 -0.6147354 C -0.8312698 -1.1864784
ind9 2 -0.76524417 -0.9873662 C -0.8312698 -1.1864784
ind10 2 -0.01614503 -1.4883271 C -0.8312698 -1.1864784
ind11 1 7.37664013 -1.5121731 D 7.5796202 -0.7459455
ind12 1 5.69439899 0.1074283 D 7.5796202 -0.7459455
ind13 1 6.83721986 -0.9119275 D 7.5796202 -0.7459455
ind14 1 9.66081076 -1.7497733 D 7.5796202 -0.7459455
ind15 1 7.31749818 -1.3984597 E 7.5796202 -0.7459455
This dataframe contains the results of a cluster analysis, and each row corresponds with an individual sample. For each individual, I have values for group assignment (group), xy coordinates for the individual (x, y), values for a particular trait (trait1), and xy coordinates for the centroid of each group (grp.x, grp.y).
I've used the following code to create the below plot with ggplot2:
ggplot(data) +
geom_point(aes(x=x, y=y, color=trait), size=1.5) +
stat_ellipse(aes(x=x, y=y, color=group), type="norm", level=0.6) +
geom_point(mapping = aes(x=grp.x, y=grp.y), shape=17, cex=0.75) +
geom_segment(aes(x=grp.x, y=grp.y, xend=x, yend=y), lwd=0.25) +
geom_label(aes(x=grp.x, y=grp.y, label=group))
There are two things I would like to change about this plot but am stuck:
Since the number corresponding with each group is shown as a label (e.g. 1, 2, 3), I don't really need the ellipse corresponding with each group to be color coded by group. Is there a way to use a single color for the ellipses described by stat_ellipse?
My real data have many unique values for trait1 (rather than the 5 unique values shown here), and I would therefore like to create a palette of diverging colors to use for the fill of individual points, but I'm not sure that I am creating a palette properly, and where this should be implemented in my code (particularly given the context of question 1).
I have attempted the following:
cols = colorRampPalette(brewer.pal(11,"Spectral"))(65)
ggplot(data) +
geom_point(aes(x=x, y=y, color=trait), size=1.5) +
stat_ellipse(aes(x=x, y=y, color=group), type="norm", level=0.6) +
geom_point(mapping = aes(x=grp.x, y=grp.y), shape=17, cex=0.75) +
geom_segment(aes(x=grp.x, y=grp.y, xend=x, yend=y), lwd=0.25) +
geom_label(aes(x=grp.x, y=grp.y, label=group)) +
scale_color_brewer(palette=cols)
However, I get an error message for unknown palette type (I must be creating and/or calling my palette incorrectly), and the default color brewer palette R reverts to does not contain enough unique colors.

geom_point points manual scaling

I got some data (named result.df) which looks like the following:
orgaName abundance pVal score
A 3 9.998622e-01 1.795338e-04
B 2 9.999790e-01 1.823428e-05
C 1 2.225074e-308 3.076527e+02
D 1 3.510957e-01 4.545745e-01
and so on...
What I am now plotting is this:
p1 <- ggplot(result.df, aes(log2(abundance), (1-pVal), label=orgaName)) +
ylab("1 - P-Value")+
xlab("log2(abundance)")+
geom_point(aes(size=score))+
ggtitle(colnames(case.count.matrix)[i])+
geom_text(data=subset(result.df, pVal < 0.05),hjust=.65, vjust=-1.2,size=2.5)+
geom_hline(aes(yintercept=.95), colour="blue", linetype="dashed")+
theme_classic()
Everything works fine and looks rather fine. However, what I would like is to scale the point size introduced through
geom_point(aes(size=score))+
to be scaled against fixed values. So the legend should scale in a decadic logarithm but the score should stay the same. Such that low scores nearly disappear and large scores are kind of comparable with respect to their point size between different "result.df".
EDIT
After checking on the comments of #roman and #vrajs5 I was able to produce a plot like this .
Using the following code:
ggplot(result.df, aes(log2(abundance), (1-pVal), label=orgaName)) +
ylab("1 - P-Value")+
xlab("log2(abundance)")+
geom_point(aes(size=score))+
ggtitle(colnames(case.count.matrix)[i])+
#geom_text(data=subset(result.df, pVal < 0.05 & log2(abundance) > xInt),hjust=.65, vjust=-1.2,size=2.5)+
geom_text(data=subset(result.df, pVal < 0.05),hjust=.65, vjust=-1.2,size=2.5)+
geom_hline(aes(yintercept=.95), colour="blue", linetype="dashed")+
#geom_vline(aes(xintercept=xInt), colour="blue", linetype="dashed")+
#geom_text(data=subset(result.df, pVal > 0.05 & log2(abundance) > xInt),alpha=.5,hjust=.65, vjust=-1.2,size=2)+
#geom_text(data=subset(result.df, pVal < 0.05 & log2(abundance) < xInt),alpha=.5,hjust=.65, vjust=-1.2,size=2)+
theme_classic() +
scale_size(range=c(2,12),expand=c(2,0),breaks=c(0,1,10,100,1000,1000000),labels=c(">=0",">=1",">=10",">=100",">=1000",">=1000000"),guide="legend")
As you can see, the breaks are introduced and labeled as intendet. However the point size in the legend does not reflect the point sizes in the plot. Any idea how to fix this?

As #Roman mentioned, if you use scale_size you can specify the limits on size..
Following is the example how to control size of points
result.df = read.table(text = 'orgaName abundance pVal score
A 3 9.998622e-01 1.795338e-04
B 2 9.999790e-01 1.823428e-05
C 1 2.225074e-308 3.076527e+02
D 1 3.510957e-01 4.545745e-01
E 3 2.510957e-01 2.545745e+00
F 3 1.510957e-02 2.006527e+02
G 2 5.510957e-01 3.545745e-02', header = T)
library(ggplot2)
ggplot(result.df, aes(log2(abundance), (1-pVal), label=orgaName)) +
ylab("1 - P-Value")+
xlab("log2(abundance)")+
geom_point(aes(size=score))+
#ggtitle(colnames(case.count.matrix)[i])+
geom_text(data=subset(result.df, pVal < 0.05),hjust=.65, vjust=-1.2,size=2.5)+
geom_hline(aes(yintercept=.95), colour="blue", linetype="dashed")+
theme_classic() +
scale_size(range = c(2,12))
Output graph is

Colour regulated genes in Volcano plot - ggplot2

Hi I'm very new in R and I'm struggling trying to modify an R code that I found on internet when learning how to make a volcano plot.
This code is to make volcano plots using ggplot2 and the problem I have is that I want to colour the up- and down-regulated proteins instead of colouring the proteins above the specified threshold. The code I'm using is the following:
install.packages("ggplot2")
gene_list <- read.table("/Users/Javi/Desktop/gene_list.csv", header=T, sep=",")
require(ggplot2)
##Highlight genes that have an absolute fold change > 2 and a p-value < 0.05
gene_list$threshold = as.factor(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.05)
##Construct the plot object
g = ggplot(data=gene_list, aes(x=logFC, y=-log10(P.Value), colour=my_palette)) +
geom_point(alpha=0.4, size=5) +
theme(legend.position = "none") +
xlim(c(-10, 10)) + ylim(c(0, 15)) +
xlab("log2 fold change") + ylab("-log10 p-value")
g
What I would like to do is to colour in red (for example) the logFC values > 1.3 and in blue the logFC values < -1.3
The csv file I'm using is just an example and would be something like this:
logFC P.Value
a 2 0.04
b 5 0.04
c 8 0.04
d 4 0.000005
e 7 0.01
f 1 0.04
g -6 0.0001
h -8 0.04
Thanks very much for your help in advance.
Cheers
Javi

Create a new color flag on your dataframe:
gene_list$color_flag <- ifelse(gene_list$logFC > 1.3, 1, ifelse(gene_list$logFC < -1.3, -1, 0))
Then add fill = color_flag to your aes.

How to dodge points in ggplot2 in R

df = data.frame(subj=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10), block=factor(rep(c(1,2),10)), acc=c(0.75,0.83,0.58,0.75,0.58,0.83,0.92,0.83,0.83,0.67,0.75,0.5,0.67,0.83,0.92,0.58,0.75,0.5,0.67,0.67))
ggplot(df,aes(block,acc,group=subj)) + geom_point(position=position_dodge(width=0.3)) + ylim(0,1) + labs(x='Block',y='Accuracy')
How do I get points to dodge each other uniformly in the horizontal direction? (I grouped by subj in order to get it to dodge at all, which might not be the correct thing to do...)

I think this might be what you were looking for, although no doubt you have solved it by now.
Hopefully it will help someone else with the same issue.
A simple way is to use geom_dotplot like this:
ggplot(df,aes(x=block,y=acc)) +
geom_dotplot(binaxis = "y", stackdir = "center", binwidth = 0.03) + ylim(0,1) + labs(x='Block',y='Accuracy')
This looks like this:
Note that x (block in this case) has to be a factor for this to work.

If they don't have to be perfectly aligned horizontally, here's one quick way of doing it, using geom_jitter. You don't need to group by subj.
Method 1 [Simpler]: Using geom_jitter()
ggplot(df,aes(x=block,y=acc)) + geom_jitter(position=position_jitter(0.05)) + ylim(0,1) + labs(x='Block',y='Accuracy')
Play with the jitter width for greater degree of jittering.
which produces:
Method 2: Deterministically calculating the jitter value for each row
We first use aggregate to count the number of duplicated entries. Then in a new data frame, for each duplicated value, move it horizontally to the left by an epsilon distance.
df$subj <- NULL #drop this so that aggregate works.
#a new data frame that shows duplicated values
agg.df <- aggregate(list(numdup=seq_len(nrow(df))), df, length)
agg.df$block <- as.numeric(agg.df$block) #block is not a factor
# block acc numdup
#1 2 0.50 2
#2 1 0.58 2
#3 2 0.58 1
#4 1 0.67 2
#...
epsilon <- 0.02 #jitter distance
new.df <- NULL #create an expanded dataframe, with block value jittered deterministically
r <- 0
for (i in 1:nrow(agg.df)) {
for (j in 1:agg.df$numdup[i]) {
r <- r+1 #row counter in the expanded df
new.df$block[r] <- agg.df$block[i]
new.df$acc[r] <- agg.df$acc[i]
new.df$jit.value[r] <- agg.df$block[i] - (j-1)*epsilon
}
}
new.df <- as.data.frame(new.df)
ggplot(new.df,aes(x=jit.value,y=acc)) + geom_point(size=2) + ylim(0,1) + labs(x='Block',y='Accuracy') + xlim(0,3)
which produces:

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Troubles with ggplot and geom_bar - r

Related

Ggplot does not label all interesting peptides for volcano plot

R ggplot2: colors for stat_ellipse and for categorical variables with several unique values

geom_point points manual scaling

Colour regulated genes in Volcano plot - ggplot2

How to dodge points in ggplot2 in R

Categories

Resources