Is there a way to change the shape of the points for missing data in R? I am plotting .csv files like this one in a lollipop style.
Name,chr,Pos,Reads...ME_016,Reads...ME_017,Reads...ME_018,Reads...ME_019
cg01389728,chr10,6620395,33.82,41.38,41.38,38.46
cg01389728,chr10,6620410,0,-,-,-
cg01389728,chr10,6620430,0,0,-,-
cg01389728,chr10,6620447,0,-,0,-
cg01389728,chr10,6620478,0,-,-,-
cg01389728,chr10,6620510,28.33,29.85,25.64,28.13
cg01389728,chr10,6620520,0,0,-,0
cg01389728,chr10,6620531,0,-,50,-
Using ggplot2, my graphs are created with this:
dataset <-read.table("testset", sep=",",na.strings="-", header=TRUE)
dataset <- subset(dataset, select=c(-Name, -chr))
dataset <- melt(dataset, id.vars="Pos")
dataset$variable <- gsub("\\.\\.\\.","_",dataset$variable)
xaxes <- unique(dataset$Pos)
dataset$Pos <- as.factor(dataset$Pos)
ggplot(dataset, aes(x=Pos, y=variable,fill=cut(value, breaks=10))) + geom_point(size=4, shape=21) + geom_line() + scale_fill_discrete(labels=c("0-10%","10-20%","20-30%","30-40%","40-50%","50-60%","60-70%","70-80%","80-90%","90-100%")) +
xlab("CpG Positions") +
ylab("Sample") +
labs(fill="Coverage in %") +
theme_bw() +
theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5),plot.title = element_text(vjust=2),axis.title.x = element_text(vjust=-0.5),axis.title.y = element_text(vjust=1.5))
However, I want to set the shape of the missing points ("-") in the plot to an "x", (shape=4) and show them also in the legend.
I've tried approaches like:
scale_fill_manual(values=c(value, NA))
or:
scale_shape_manual(values=c(21,4))
By default, the "-" are also shown with shape 21 and grey colour. There must be a way to manipulate this? Writing a method like this might be the trick, but how to call it for the whole column?
formas <- function(x){
+ if(is.na(x)) forma <- 4
+ if(!is.na(x)) forma <- 21
+ return(forma)
+ }
This comes pretty close, I think.
ggplot(dataset, aes(x=Pos, y=variable,
color=cut(value, breaks=10),
shape=ifelse(is.na(value),"Missing","Present"))) +
geom_point(size=4) +
geom_line() +
scale_shape_manual(name="",values=c(Missing=4,Present=19))+
scale_color_discrete(labels=c("0-10%","10-20%","20-30%","30-40%","40-50%","50-60%","60-70%","70-80%","80-90%","90-100%")) +
xlab("CpG Positions") +
ylab("Sample") +
labs(color="Coverage in %") +
theme_bw() +
theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5),plot.title = element_text(vjust=2),axis.title.x = element_text(vjust=-0.5),axis.title.y = element_text(vjust=1.5))
Change are:
used color instead of fill, with shape=19 for points with data
added shape aesthetic to ggplot(...) call.
removed shape=21 from geom_point(...) call.
added scale_shape_manual(...) to define the shapes for Missing and Present, and turn off the guide label.
I know you wanted filled points with a black outline (it does look better), but when I tried that with the added shape aesthetic, the fill legend does not display the colors correctly. Try it yourself.
Here is another approach that comes closer to producing the graph you specified (circular points with black outline and fill color determined by coverage).
fill.colors <- hcl(h=seq(15, 375, length=11), l=65, c=100)[1:10]
ggplot(dataset, aes(x=Pos, y=variable,
fill=cut(value, breaks=10),
shape=ifelse(is.na(value),"Missing","Present"))) +
geom_point(size=4) +
geom_line() +
scale_fill_manual(name="Coverage in %",
values=fill.colors,
labels=c("0-10%","10-20%","20-30%","30-40%","40-50%","50-60%","60-70%","70-80%","80-90%","90-100%"),
drop=FALSE) +
scale_shape_manual(name="",values=c(Missing=4,Present=21),limits=c("Missing"))+
xlab("CpG Positions") +
ylab("Sample") +
labs(color="Coverage in %") +
theme_bw() +
theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5),
plot.title = element_text(vjust=2),
axis.title.x = element_text(vjust=-0.5),
axis.title.y = element_text(vjust=1.5))+
guides(fill=guide_legend(override.aes=list(colour=fill.colors),order=1))
The problem in the other answer with using point shape 21 and the fill aesthetic is that, while the fill colors are displayed correctly in the plot, they are not displayed correctly in the legend. One way around that is to force ggplot to set the legend fill colors using
guides(fill=guide_legend(override.aes=list(colour=fill.colors),order=1))
Unfortunately, to do that you have to specify the fill colors manually (so that the actual fill and the override fill are the same). This code does that using
fill.colors <- hcl(h=seq(15, 375, length=11), l=65, c=100)[1:10]
which creates a color palette that mimics the ggplot default. You could of course use your own color palette here.
While this does come closer to your original intent, I actually think the other answer provides a better data visualization. The black outlines around the points, while "attractive", make it much more difficult to distinguish between fill colors, especially with 10 possible colors (which is at the edge of discernability anyway).
I can't see, why this is not working:
fill.colors <- hcl(h=seq(15, 375, length=11), l=65, c=100)[1:10]
ggplot(dataset, aes(x=Pos, y=variable
,color=cut(value, breaks=c(-0.01,10,20,30,40,50,60,70,80,90,100))
,shape=ifelse(is.na(value),"Missing","Present"))) +
geom_point(size=4) +
scale_shape_manual(name="",values=c("Missing"=4,"Present"=19),limits=c("Missing"))+
scale_color_manual(name="Coverage in %",
values=ifelse(is.na(dataset$value),"grey",fill.colors),
labels=c("0-10%","10-20%","20-30%","30-40%","40-50%","50-60%","60-70%","70-80%","80-90%","90-100%"),drop=FALSE) +
theme_bw() +
theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5),
plot.title = element_text(vjust=2),
axis.title.x = element_text(vjust=-0.5),
axis.title.y = element_text(vjust=1.5)) +
xlab("CpG Positions") +
ylab("Sample") +
labs(color="Coverage in %") +
guides(fill=guide_legend(override.aes=list(colour=fill.colors),order=1))
NA values are not shown anymore with an X, and instead of displaying them in "grey", the class 90-100% will be shown in grey. No error message is shown - what is the problem?
I have a empirical PDF + CDF combo I'd like to plot on the same panel. distro.df has columns pdf, cdf, and day. I'd like the pdf values to be plotted as bars, and the cdf as lines. This does the trick for making the plot:
p <- ggplot(distro.df, aes(x=day))
p <- p + geom_bar(aes(y=pdf/max(pdf)), stat="identity", width=0.95, fill=fillCol)
p <- p + geom_line(aes(y=cdf))
p <- p + xlab("Day") + ylab("")
p <- p + theme_bw() + theme_update(panel.background = element_blank(), panel.border=element_blank())
However, I'm having trouble getting a legend to appear. I'd like a line for the cdf and a filled block for the pdf. I've tried various contortions with guides, but can't seem to get anything to appear.
Suggestions?
EDIT:
Per #Henrik's request: to make a suitable distro.df object:
df <- data.frame(day=0:10)
df$pdf <- runif(length(df$day))
df$pdf <- df$pdf / sum(df$pdf)
df$cdf <- cumsum(df$pdf)
Then the above to make the plot, then invoke p to see the plot.
This generally involves moving fill into aes and using it in both the geom_bar and geom_line layers. In this case, you also need to add show_guide = TRUE to geom_line.
Once you have that, you just need to set the fill colors in scale_fill_manual so CDF doesn't have a fill color and use override.aes to do the same thing for the lines. I didn't know what your fill color was, so I just used red.
ggplot(df, aes(x=day)) +
geom_bar(aes(y=pdf/max(pdf), fill = "PDF"), stat="identity", width=0.95) +
geom_line(aes(y=cdf, fill = "CDF"), show_guide = TRUE) +
xlab("Day") + ylab("") +
theme_bw() +
theme_update(panel.background = element_blank(),
panel.border=element_blank()) +
scale_fill_manual(values = c(NA, "red"),
breaks = c("PDF", "CDF"),
name = element_blank(),
guide = guide_legend(override.aes = list(linetype = c(0,1))))
I'd still like a solution to the above (and will checkout #aosmith's answer), but I am currently going with a slightly different approach to eliminate the need to solve the problem:
p <- ggplot(distro.df, aes(x=days, color=pdf, fill=pdf))
p <- p + geom_bar(aes(y=pdf/max(pdf)), stat="identity", width=0.95)
p <- p + geom_line(aes(y=cdf), color="black")
p <- p + xlab("Day") + ylab("CDF")
p <- p + theme_bw() + theme_update(panel.background = element_blank(), panel.border=element_blank())
p
This also has the advantage of displaying some of the previously missing information, namely the PDF values.
Sample of the dataset.
nq
0.140843018
0.152855833
0.193245919
0.156860105
0.171658019
0.186281942
0.290739146
0.162779517
0.164694042
0.171658019
0.195866609
0.166967913
0.136841748
0.108907644
0.264136384
0.356655651
0.250508305
I would like to make a Percentage Bar plot/Histogram like this question: RE: Alignment of numbers on the individual bars with ggplot2
The max value of NQ for full dataset is 21 and minimum value is 0.00005
But I am unable to adapt the code as I don't have a Freq column and I have one series.
I have made a mockup of the figure I am trying to make.
Could you please help?
Would that work for you?
nq <- read.table(text = "
0.140843018
0.152855833
0.193245919
0.156860105
0.171658019
0.186281942
0.290739146
0.162779517
0.164694042
0.171658019
0.195866609
0.166967913
0.136841748
0.108907644
0.264136384
0.356655651
0.250508305", header = F) # Your data
nq$V2 <- cut(nq$V1, 5, include.lowest = T)
nq2 <- aggregate(V1 ~ V2, nq, length)
nq2$V3 <- nq2$V1/sum(nq2$V1)
library(ggplot2)
ggplot() + geom_bar(data = nq2, aes(V2, V1), stat = "identity", width=1, fill = "white", col = "black", size = 2) +
geom_text(vjust=1, fontface="bold", data = nq2, aes(label = paste(sprintf("%.1f", V3*100), "%", sep=""), x = V2, y = V1 + 0.4), size = 5) +
theme_bw() +
scale_x_discrete(expand = c(0,0), labels = sprintf("%.3f",seq(min(nq$V1), max(nq$V1), by = max(nq$V1)/6))) +
ylab("No. of Cases") + xlab("") +
scale_y_continuous(expand = c(0,0)) +
theme(
axis.title.y = element_text(size = 20, face = "bold", angle = 0),
panel.grid.major = element_blank() ,
panel.grid.minor = element_blank() ,
panel.border = element_blank() ,
panel.background = element_blank(),
axis.line = element_line(color = 'black', size = 2),
axis.text.x = element_text(face="bold"),
axis.text.y = element_text(face="bold")
)
I thought this would be easy, but it turned out to be frustrating. So perhaps the "right" way is to transform your data before using ggplot as it looks like #DavidArenburg has done. But, if you feel like hacking ggplot, here's what I ended up doing.
First, some sample data.
set.seed(15)
dd<-data.frame(x=sample(1:25, 100, replace=T, prob=25:1))
br <- seq(0,25, by=5) # break points
My first attempt was
library(ggplot2)
ggplot(dd, aes(x)) +
stat_bin(position="stack", breaks=br) +
geom_text(aes(y=..count.., label=..density..*..width.., ymax=..count..+1),
vjust=-.5, breaks=br, stat="bin")
but that didn't make "pretty labels"
so i thought i'd use the percent() function from the scales package to make it pretty. However, silly ggplot doesn't really make it possible to use functions with ..().. variables because it evaluates them in the data.frame only (then the empty baseenv()). It doesn't have a way to find the function you use. So this is when I turned to hacking. First i'll extract the "Layer" definition from ggplot and the map_statistic from it. (NOTE: this was done with "ggplot2_1.0.0" and is specific to that version; this is a private function that may change in future releases)
orig.map_statistic <- ggplot2:::Layer$map_statistic
new.map_statistic <- orig.map_statistic
body(new.map_statistic)[[9]]
# stat_data <- as.data.frame(lapply(new, eval, data, baseenv()))
here's the line that's causing grief I would prefer it the function resolved other names in the plot environment that are not found in the data.frame. So I decided to change it with
body(new.map_statistic)[[9]] <- quote(stat_data <- as.data.frame(lapply(new, eval, data, plot$plot_env)))
assign("map_statistic", new.map_statistic, envir=ggplot2:::Layer)
So now I can use functions with ..().. variables. So I can do
library(scales)
ggplot(dd, aes(x)) +
stat_bin(position="stack", breaks=br) +
geom_text(aes(y=..count.., ymax=..count..+2,
label=percent(..density..*..width..)),
vjust=-.5, breaks=br, stat="bin")
to get
So i'm not sure why ggplot has this default behavior. There could be some good reason for it but I don't know what it is. This does change how ggplot will behave for the rest of the session. You can change back to default with
assign("map_statistic", orig.map_statistic, envir=ggplot2:::Layer)
Suppose I have a data set called "data", and is generated through:
library(reshape2) # Reshape data, needed in command "melt"
library(ggplot2) # apply ggplot
density <-rep (0.05, each=800)
tau <-rep (0.05, each=800)
# define two different models: network and non-network
model <-rep(1:2, each=400, times=1)
## Create data and factors for the plot
df <- melt(rnorm(800, -3, 0.5))
data <- as.data.frame(cbind(density, tau, model, df$value))
data$density <- factor(data$density,levels=0.05,
labels=c("Density=0.05"))
data$tau <- factor(data$tau,levels=0.05,
labels=c("tau=0.05"))
data$model<- factor(data$model,levels=c(1,2),
labels=c("Yes",
"No"))
ggplot(data=data, aes(x=V4, shape=model, colour=model, lty=model)) +
stat_density(adjust=1, geom="line",position="identity") +
facet_grid(tau~density, scale="free") +
geom_vline(xintercept=-3, lty="dashed") +
ggtitle("Kernel Density") +
xlab("Data") +
ylab("Kernel Density") +
theme(plot.title=element_text(face="bold", size=17), # change fond size of title
axis.text.x= element_text(size=14),
axis.text.y= element_text(size=14),
legend.title=element_text(size=14),
legend.text =element_text(size=12),
strip.text.x=element_text(size=14), # change fond size of x_axis
strip.text.y=element_text(size=14)) # change fond size of y_axis
Looking at the data, variable V4 is separated into two subsets by the model (Yes [1:400] and No [401:800]), and the kernel density is plotted without change the original bandwidth since adjust=1.
What I want to do is: for the Yes model, the bandwidth changes to 10 times of the original, but for the No model, the bandwidth keeps unchanged. Can I do something like letting the adjust=c(10, 1)? I know how to realize this by plot()+lines(), but I want to do this in ggplot() for further analysis.
I wouldn't recommend this, since it creates a very misleading plot, but you can do it with two calls to stat_density(...).
ggplot(data=data, aes(x=V4, shape=model, colour=model, lty=model)) +
stat_density(data=data[data$model=="Yes",], adjust=10,
geom="line",position="identity") +
stat_density(data=data[data$model=="No",], adjust=1,
geom="line",position="identity") +
facet_grid(tau~density, scale="free") +
geom_vline(xintercept=-3, lty="dashed") +
ggtitle("Kernel Density") +
xlab("Data") +
ylab("Kernel Density") +
theme(plot.title=element_text(face="bold", size=17),
axis.text.x= element_text(size=14),
axis.text.y= element_text(size=14),
legend.title=element_text(size=14),
legend.text =element_text(size=12),
strip.text.x=element_text(size=14),
strip.text.y=element_text(size=14))