ggplot2 legend for abline and stat_smooth - r

I have some problems with ggplot legends, here is my first code with only the legend for corrGenes, which is fine.
gene1=c(1.041,0.699,0.602,0.602,2.585,0.602,1.000,0.602,1.230,1.176,0.699,0.477,1.322)
BIME = c(0.477,0.477,0.301,0.477,2.398,0.301,0.602,0.301,0.602,0.699,0.602,0.477,1.176)
corrGenes=c(0.922,0.982,0.934,0.917,0.993,0.697,0.000,0.440,0.859,0.788,0.912,0.687,0.894)
DF=data.frame(gene1,BIME,corrGenes)
plot= ggplot(data=DF,aes(x=gene1,y=BIME))+
geom_point(aes(colour=corrGenes),size=5)+
ylab("BIME normalized counts (log10(RPKM))")+
xlab("gene1 normalized counts (log10(RPKM))")
When I add abline and smooth, I get the correct plot with :
plot= ggplot(data=DF,aes(x=gene1,y=BIME))+
geom_point(aes(colour=corrGenes),size=5)+
geom_abline(intercept=0, slope=1)+
stat_smooth(method = "lm",se=FALSE)+
ylab("BIME normalized counts (log10(RPKM))")+
xlab("gene1 normalized counts (log10(RPKM))")
but no way to get the legend for them, I tried and many other combinations:
plot= ggplot(data=DF,aes(x=gene1,y=BIME))+
geom_point(aes(colour=corrGenes),size=5)+
geom_abline(aes(colour="best"),intercept=0, slope=1)+
stat_smooth(aes(colour="data"),method = "lm",se=FALSE)+
scale_colour_manual(name="Fit", values=c("data"="blue", "best"="black"))+
ylab("BIME normalized counts (log10(RPKM))")+
xlab("gene1 normalized counts (log10(RPKM))")
If anyone has an idea to solve this tiny but very annoying problem, it would be very helpfull!

Finally, I found anther way using a trick. First, I've computed the linear regression and convert the results to a data frame which I add my best fit (Intercept = 0 and slope =1), then I added a column for type of data (data or best).
modele = lm(BIME ~ gene1, data=DF)
coefs = data.frame(intercept=coef(modele)[1],slope=coef(modele)[2])
coefs= rbind(coefs,list(0,1))
regression=as.factor(c('data','best'))
coefs=cbind(coefs,regression)
then I plotted it with a unique geom_abline command and moving the DF from ggplot() to geom_point() and used the linetype parameter to differenciate the two lines :
plot = ggplot()+
geom_point(data=pointSameStrandDF,aes(x=gene1,y=BIME,colour=corrGenes),size=5)+
geom_abline(data=coefs, aes(intercept=intercept,slope=slope,linetype=regression), show_guide=TRUE)+
ylab("BIME normalized counts (log10(RPKM))")+
xlab("gene1 normalized counts (log10(RPKM))")
There is maybe a way to use colors for those 2 lines, but I can't find out how?
Thanks for your help guys!

The show_guide=TRUE argument should display the legends for both geom_abline and stat_smooth. Try running the below code.
plot= ggplot(data=DF,aes(x=gene1,y=BIME))+
geom_point(aes(colour=corrGenes),size=5)+
geom_abline(aes(colour="best"),intercept=0, slope=1, show_guide=TRUE)+
stat_smooth(aes(colour="data"),method = "lm",se=FALSE, show_guide=TRUE)+
scale_colour_manual(name="Fit", values=c("data"="blue", "best"="black"))+
ylab("BIME normalized counts (log10(RPKM))")+
xlab("gene1 normalized counts (log10(RPKM))")

Not sure if this is the best solution, but I was able to tell ggplot to have two scales, one for the colours (your points), the other one for the fill colour. Which fill colour you are probably asking? The one I added in the aes for the two lines:
plot = ggplot(data=DF,aes(x=gene1,y=BIME)) +
geom_point(size=5, aes(colour=corrGenes)) +
geom_abline(aes(fill="black"),intercept=0, slope=1) +
stat_smooth(aes(fill="blue"), method = "lm",se=FALSE) +
scale_fill_manual(name='My Lines', values=c("black", "blue"))+
ylab("BIME normalized counts (log10(RPKM))")+
xlab("gene1 normalized counts (log10(RPKM))")

Related

Extend line length with geom_line

I want to represent three lines on a graph overlain with datapoints that I used in a discriminant function analysis. From my analysis, I have two points that fall on each line and I want to represent these three lines. The lines represent the probability contours of the classification scheme and exactly how I got the points on the line are not relevant to my question here. However, I want the lines to extend further than the points that define them.
df <-
data.frame(Prob = rep(c("5", "50", "95"), each=2),
Wing = rep(c(107,116), 3),
Bill = c(36.92055, 36.12167, 31.66012, 30.86124, 26.39968, 25.6008))
ggplot()+
geom_line(data=df, aes(x=Bill, y=Wing, group=Prob, color=Prob))
The above df is a dataframe for my points from which the three lines are constructed. I want the lines to extend from y=105 to y=125.
Thanks!
There are probably more idiomatic ways of doing it but this is one way to get it done.
In short you quickly calculate the linear formula that will connect the lines i.e y = mx+c
df_withFormula <- df |>
group_by(Prob) |>
#This mutate command will create the needed slope and intercept for the geom_abline command in the plotting stage.
mutate(increaseBill = Bill - lag(Bill),
increaseWing = Wing - lag(Wing),
slope = increaseWing/increaseBill,
intercept = Wing - slope*Bill)
# The increaseBill, increaseWing and slope could all be combined into one calculation but I thought it was easier to understand this way.
ggplot(df_withFormula, aes(Bill, Wing, color = Prob)) +
#Add in this just so it has something to plot ontop of. You could remove this and instead manually define all the limits (expand_limits would work).
geom_point() +
#This plots the three lines. The rows with NA are automatically ignored. More explicit handling of the NA could be done in the data prep stage
geom_abline(aes(slope = slope, intercept = intercept, color = Prob)) +
#This is the crucial part it lets you define what the range is for the plot window. As ablines are infite you can define whatever limits you want.
expand_limits(y = c(105,125))
Hope this helps you get the graph you want.
This is very much dependent on the structure of your data it could though be changed to fit different shapes.
Similar to the approach by #James in that I compute the slopes and the intercepts from the given data and use a geom_abline to plot the lines but uses
summarise instead of mutate to get rid of the NA values
and a geom_blank instead of a geom_point so that only the lines are displayed but not the points (Note: Having another geom is crucial to set the scale or the range of the data and for the lines to show up).
library(dplyr)
library(ggplot2)
df_line <- df |>
group_by(Prob) |>
summarise(slope = diff(Wing) / diff(Bill),
intercept = first(Wing) - slope * first(Bill))
ggplot(df, aes(x = Bill, y = Wing)) +
geom_blank() +
geom_abline(data = df_line, aes(slope = slope, intercept = intercept, color = Prob)) +
scale_y_continuous(limits = c(105, 125))

Probability density Matrix Subtraction for heatmaps

Pretty new to R and stuck. I am attempting to normalize 2d probability density of a heat map by subtracting the 2d probability densities of another data set. I am looking where behaviors occur in space, however to do this I want to subtract out where the subjects just spend most of their time from were the behaviors are occuring to get an idea of relative density of just the behaviors. To do this I am trying to find the probability density matrices used to plot a heatmap for the following code:
ctrlplot<-ctrl %>% ggplot(aes(x=x, y=y)) +
stat_density_2d(geom = "raster", aes(fill = stat(density)), contour = FALSE)+
scale_fill_gradientn(colours=matlab.like(15), na.value = "gray",
as lowertick, uppertick, interval
limit=c(0,1.3e-05)) #sets the static limit of probabilities.
This works to make the heat plot for either data set plot, however I cannot find where ggplot or stat_density_2d is storing the density data to subtract the two.
Alternatively I have tried to get just the densities for both data sets using the following code and storing it as the variable dens:
n<-100
h<-c(bandwidth.nrd(ctrl$x),bandwidth.nrd(ctrl$y))
dens<-kde2d(ctrl$x,ctrl$y,n=n,h=h)
Now I am not sure how to subtract the resulting z values and get it back into a heat plot. I know there is likely an easy solution for this, but I am definitely stuck. Any advice on how to do this easier, or other suggestions on how to subtract the densities from one another would be greatly appreciated.
UPDATE:
I found a way to pull the density data from ggplot. I was able to pull the density data from two different data sets, subtract the vectors and place the densities back into the original data frame using the following code:
ctrlplot<-ctrl %>% ggplot(aes(x=x, y=y)) +
stat_density_2d(geom = "raster", aes(fill = stat(density)), contour = FALSE)+
scale_fill_gradientn(colours=matlab.like(15), na.value = "gray")
ctxplot<-ctx %>% ggplot(aes(x=x, y=y)) +
stat_density_2d(geom = "raster", aes(fill = stat(density)), contour = FALSE)+
scale_fill_gradientn(colours=matlab.like(15), na.value = "gray")
ctrlplot2<-ggplot_build(ctrlplot)
gbctrl<-ctrlplot2$data[[1]]
densctrl<-gbctrl$density
gbctx<-ggplot_build(ctxplot)
gbctx<-gbctx$data[[1]]
densctx<-gbctx$density
diff_ctrl_ctx<-densctrl-densctx
gbctrl$density<-diff_ctrl_ctx
ctrlplot2$data[[1]]<-gbctrl
ctrlplot2
ctrlplot
However the last two plots ctrlplot (original) and ctrlplot2(subtracted densities) give the same plot. Not sure if I am not replacing the correct parts of the data frame so that it updates for the graphing part since there are different lists in the original ggplot_build.

ggplot geom_histogram color by factor not working properly

In trying to color my stacked histogram according to a factor column; all the bars have a "green" roof? I want the bar-top to be the same color as the bar itself. The figure below shows clearly what is wrong. All the bars have a "green" horizontal line at the top?
Here is a dummy data set :
BodyLength <- rnorm(100, mean = 50, sd = 3)
vector <- c("80","10","5","5")
colors <- c("black","blue","red","green")
color <- rep(colors,vector)
data <- data.frame(BodyLength,color)
And the program I used to generate the plot below :
plot <- ggplot(data = data, aes(x=data$BodyLength, color = factor(data$color), fill=I("transparent")))
plot <- plot + geom_histogram()
plot <- plot + scale_colour_manual(values = c("Black","blue","red","green"))
Also, since the data column itself contains color names, any way I don't have to specify them again in scale_color_manual? Can ggplot identify them from the data itself? But I would really like help with the first problem right now...Thanks.
Here is a quick way to get your colors to scale_colour_manual without writing out a vector:
data <- data.frame(BodyLength,color)
data$color<- factor(data$color)
and then later,
scale_colour_manual(values = levels(data$color))
Now, with respect to your first problem, I don't know exactly why your bars have green roofs. However, you may want to look at some different options for the position argument in geom_histogram, such as
plot + geom_histogram(position="identity")
..or position="dodge". The identity option is closer to what you want but since green is the last line drawn, it overwrites previous the colors.
I like density plots better for these problems myself.
ggplot(data=data, aes(x=BodyLength, color=color)) + geom_density()
ggplot(data=data, aes(x=BodyLength, fill=color)) + geom_density(alpha=.3)

Overlay ggplot2 stat_density2d plots with alpha channels constant across groups

I would like to plot multiple groups in a stat_density2 plot with alpha values related to the counts of observations in each group. However, the levels formed by stat_density2d seem to be normalized to the number of observations in each group. For example,
temp <- rbind(movies[1:2,],movies[movies$mpaa == "R" | movies$mpaa == "PG-13",])
ggplot(temp, aes(x=rating,y=length)) +
stat_density2d(geom="tile", aes(fill = mpaa, alpha=..density..), contour=FALSE) +
theme_minimal()
Creates a plot like this:
Because I only included 2 points without ratings, they result in densities that look much tighter/stronger than the other two, and so wash out the other two densities. I've tried looking at Overlay two ggplot2 stat_density2d plots with alpha channels and Specifying the scale for the density in ggplot2's stat_density2d but they don't really address this specific issue.
Ultimately, what I'm trying to accomplish with my real data, is I have "power" samples from discrete 2d locations for multiple conditions, and I am trying to plot what their relative powers/spatial distributions are. I am duplicating points in locations relative to their powers, but this has resulted in low power conditions with just a few locations looking the strongest when using stat_density2d. Please let me know if there is a better way of going about doing this!
Thanks!
stat_hexbin, which understands ..count.. in addition to ..density.., may work for you:
ggplot(temp, aes(x=rating,y=length)) +
stat_binhex(geom="hex", aes(fill = mpaa, alpha=..count..)) +
theme_minimal()
Although you may want to adjust the bin width.
Not the most elegant r code, but this seems to work. I normalize my real data a bit differently than this, but this gets the solution I found across. I use a for loop where I find the average power for the condition and add a new stat_density2d layer with the alpha scaled by that average power.
temp <- rbind(movies[1:2,],movies[movies$mpaa == "R" | movies$mpaa == "PG-13",])
mpaa = unique(temp$mpaa)
p <- ggplot() + theme_minimal()
for (ii in seq(1,3)) {
ratio = length(which(temp$mpaa == mpaa[ii]))
p <- p + stat_density2d(data=temp[temp$mpaa == mpaa[ii],],
aes(x=rating,y=length,fill = mpaa, alpha=..level..),
geom="polygon",
contour=TRUE,
alpha = ratio/20,
lineType = "none")
}
print(p)

R - Time series data with ggplot2

I have a time series dataset in which the x-axis is a list of events in reverse chronological order such that an observation will have an x value that looks like "n-1" or "n-2" all the way down to 1.
I'd like to make a line graph using ggplot that creates a smooth, continuous line that connects all of the points, but it seems when I try to input my data, the x-axis is extremely wonky.
The code I am currently using is
library(ggplot2)
theoretical = data.frame(PA = c("n-1", "n-2", "n-3"),
predictive_value = c(100, 99, 98));
p = ggplot(data=theoretical, aes(x=PA, y=predictive_value)) + geom_line();
p = p + scale_x_discrete(labels=paste("n-", 1:3, sep=""));
The fitted line and grid partitions that would normally appear using ggplot are replaced by no line and wayyy too many partitions.
When you use geom_line() with a factor on at least one axis, you need to specify a group aesthetic, in this case a constant.
p = ggplot(data=theoretical, aes(x=PA, y=predictive_value, group = 1)) + geom_line()
p = p + scale_x_discrete(labels=paste("n-", 1:3, sep=""))
p
If you want to get rid of the minor grid lines you can add
theme(panel.grid.minor = element_blank())
to your graph.
Note that it can be a little risky, scale-wise, to use factors on one axis like this. It may work better to use a typical continuous scale, and just relabel the points 1, 2, and 3 with "n-1", "n-2", and "n-3".

Resources