How to add multiple horizontal ribbons (like geom_hlines) in ggplot2? - r

I would like to plot some horizontal lines onto a scatterplot (e.g. with geom_hline) and then put some error ribbons around those lines that have different widths for each line.
I have a data frame consisting of a continuous x and y and grouping factor:
#make the dataframe:
so<-data.frame(expand.grid(x=c(1:5),sys=c("a","b","c","d")))
so$y<-c(1,2,1,3,2,2,1,3,2,3,4,3,2,3,4,5,4,3,4,5)
And a second dataframe with information for some hlines and error ribbons that I would like to add to the plot:
#make the second dataframe:
so2<-data.frame(sys=c("a","b","c","d"),yint=c(1.4,2.3,3.5,4.6),low=c(1.2,2.1,3.4,4.1),
upp=c(1.6,2.7,3.6,4.7))
I can create a plot with the hlines:
ggplot(so,aes(x=x,y=y,colour=sys)) +
geom_point(position=position_jitter()) +
geom_hline(data=so2,aes(yintercept=yint,colour=sys))
But if I try to put the ribbons around them, the ggplot gets lost without x values:
ggplot(so,aes(x=x,y=y,colour=sys)) +
geom_point(position=position_jitter()) +
geom_hline(data=so2,aes(yintercept=yint,colour=sys))+
geom_ribbon(data=so2,aes(ymin=low,ymax=upp))
#Error in FUN(X[[i]], ...) : object 'x' not found
Is it possible to get geom_ribbon to act like geom_hline? Or is there a workaround of e.g. plotting the upper and lower bounds as hlines and somehow shading between them?

I'm not sure I understand what you're trying to achieve, but if you use geom_rect() instead of geom_ribbon() you can indicate the upper/lower bounds, e.g.
library(tidyverse)
so<-data.frame(expand.grid(x=c(1:5),sys=c("a","b","c","d")))
so$y<-c(1,2,1,3,2,2,1,3,2,3,4,3,2,3,4,5,4,3,4,5)
#make the second dataframe:
so2<-data.frame(sys=c("a","b","c","d"),yint=c(1.4,2.3,3.5,4.6),low=c(1.2,2.1,3.4,4.1),
upp=c(1.6,2.7,3.6,4.7))
ggplot(so,aes(x=x,y=y,colour=sys)) +
geom_point(position=position_jitter()) +
geom_hline(data=so2,aes(yintercept=yint, colour=sys)) +
geom_rect(data = so2, aes(ymin = low, ymax = upp,
xmin = 0.5, xmax = 5.5, fill=sys),
alpha = 0.2, inherit.aes = FALSE)
The issue with geom_ribbon() is that you have a single upper / lower bounds for all values of x, so I don't know how to make it work with geom_ribbon() unless your actual data is different to this minimal reproducible example. Hopefully this helps and makes sense.

Related

position_dodgev causes error in order of connecting points in geom_line

I want to plot over a timecourse x with y values that are often repeated (integer scores 1-4) and I want to visualize many subjects at once.
Because there is so much overlap, a vertical position dodge would be ideal, such as position_dodgev from ggstance package. However, when I try to connect the dots with geom_line, the order of the connection gets messed up and is connected based on y values and not x values.
I tried a coordinate flip work-around which was not successful. And replacing geom_line with geom_path (making sure it was ordered on the x scale) also did not work.
Here is a reproducible example:
#data
df<-tibble(x=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
y=c(1,2,3,7,7,1,2,3,7,7,2,1,6,7,7),
group=c('a','a','a','a','a','b','b','b','b','b','c','c','c','c','c'))
#horizontal dodge masks groups
ggplot(df, aes(x=x, y=y,col=group,group=group)) +
geom_point(position=position_dodge(width=0.3))+
geom_line(position=position_dodge(width=0.3))
#line connection error with vertical dodge
library(ggstance)
ggplot(df, aes(x=x, y=y,col=group,group=group)) +
geom_point(position=position_dodgev(height=0.3))+
geom_line(position=position_dodgev(height=0.3))
Horizontal dodge works as expected but does not allow visualization of all the overlapped groups in a stretch of repeated y values. Vertical dodge from ggstance connected the dots in group c in the wrong order.
I am not sure what exactly causes the issue. Knowing that position_dodge is not intended to be used with geoms and it's been called a bug, I am surprised and not at the same time about this issue.
But in any case, I found a workaround by disassembling the plot using ggplot_build, rearranging the points for geom_line within that object and then reassembling the plot again; look below:
g <- ggplot(df, aes(x=x, y=y,col=group,group=group)) +
geom_point(position=position_dodgev(height=0.3)) +
geom_line(position=position_dodgev(height=0.3))
gg <- ggplot_build(g)
# -- look at gg$data to understand following lines --
#gg$data[[2]]: data associated with geom_line as it is the 2nd geom
#c(1,2) & c(2,1): I have $group==3 ...
# ... so just need to flip 1st and 2nd datapoints within that group
gg$data[[2]][gg$data[[2]]$group==3,][c(1,2),] <-
gg$data[[2]][gg$data[[2]]$group==3,][c(2,1),]
gt <- ggplot_gtable(gg)
plot(gt)
I suspect the problem occurs due to PositionDodgev's compute_panel function, which takes in a dataset sorted by x values, & returns a dataset sorted instead by y values (within each group) after making the necessary transformations to dodge positions vertically.
The following workaround defines a new ggproto object that inherits from PositionDodgev, but reorders the dataset in compute_panel before returning it:
# new ggproto based on PositionDodgev
PositionDodgeNew <- ggproto(
"PositionDodgeNew",
PositionDodgev,
compute_panel = function (data, params, scales){
result <- ggstance:::collidev(data, params$height,
name = "position_dodgev",
strategy = ggstance:::pos_dodgev,
n = params$n,
check.height = FALSE)
result <- result[order(result$group, result$x), ] # reordering by group & x
result
})
# position function that uses PositionDodgeNew instead of PositionDodgev
position_dodgenew <- function (height = NULL, preserve = c("total", "single")) {
ggproto(NULL, PositionDodgeNew, height = height, preserve = match.arg(preserve))
}
Usage:
po <- position_dodgenew(height = 0.3)
ggplot(df,
aes(x = x, y = y, col = group)) +
geom_point(position = po) +
geom_line(position = po)

ggplot geom_histogram color by factor not working properly

In trying to color my stacked histogram according to a factor column; all the bars have a "green" roof? I want the bar-top to be the same color as the bar itself. The figure below shows clearly what is wrong. All the bars have a "green" horizontal line at the top?
Here is a dummy data set :
BodyLength <- rnorm(100, mean = 50, sd = 3)
vector <- c("80","10","5","5")
colors <- c("black","blue","red","green")
color <- rep(colors,vector)
data <- data.frame(BodyLength,color)
And the program I used to generate the plot below :
plot <- ggplot(data = data, aes(x=data$BodyLength, color = factor(data$color), fill=I("transparent")))
plot <- plot + geom_histogram()
plot <- plot + scale_colour_manual(values = c("Black","blue","red","green"))
Also, since the data column itself contains color names, any way I don't have to specify them again in scale_color_manual? Can ggplot identify them from the data itself? But I would really like help with the first problem right now...Thanks.
Here is a quick way to get your colors to scale_colour_manual without writing out a vector:
data <- data.frame(BodyLength,color)
data$color<- factor(data$color)
and then later,
scale_colour_manual(values = levels(data$color))
Now, with respect to your first problem, I don't know exactly why your bars have green roofs. However, you may want to look at some different options for the position argument in geom_histogram, such as
plot + geom_histogram(position="identity")
..or position="dodge". The identity option is closer to what you want but since green is the last line drawn, it overwrites previous the colors.
I like density plots better for these problems myself.
ggplot(data=data, aes(x=BodyLength, color=color)) + geom_density()
ggplot(data=data, aes(x=BodyLength, fill=color)) + geom_density(alpha=.3)

Dot Priority in ggplot2 jittered scatterplot [duplicate]

I'm plotting a dense scatter plot in ggplot2 where each point might be labeled by a different color:
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
df$label <- c("a")
df$label[50] <- "point"
df$size <- 2
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size))
When I do this, the scatter point labeled "point" (green) is plotted on top of the red points which have the label "a". What controls this z ordering in ggplot, i.e. what controls which point is on top of which?
For example, what if I wanted all the "a" points to be on top of all the points labeled "point" (meaning they would sometimes partially or fully hide that point)? Does this depend on alphanumerical ordering of labels?
I'd like to find a solution that can be translated easily to rpy2.
2016 Update:
The order aesthetic has been deprecated, so at this point the easiest approach is to sort the data.frame so that the green point is at the bottom, and is plotted last. If you don't want to alter the original data.frame, you can sort it during the ggplot call - here's an example that uses %>% and arrange from the dplyr package to do the on-the-fly sorting:
library(dplyr)
ggplot(df %>%
arrange(label),
aes(x = x, y = y, color = label, size = size)) +
geom_point()
Original 2015 answer for ggplot2 versions < 2.0.0
In ggplot2, you can use the order aesthetic to specify the order in which points are plotted. The last ones plotted will appear on top. To apply this, you can create a variable holding the order in which you'd like points to be drawn.
To put the green dot on top by plotting it after the others:
df$order <- ifelse(df$label=="a", 1, 2)
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size, order=order))
Or to plot the green dot first and bury it, plot the points in the opposite order:
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size, order=-order))
For this simple example, you can skip creating a new sorting variable and just coerce the label variable to a factor and then a numeric:
ggplot(df) +
geom_point(aes(x=x, y=y, color=label, size=size, order=as.numeric(factor(df$label))))
ggplot2 will create plots layer-by-layer and within each layer, the plotting order is defined by the geom type. The default is to plot in the order that they appear in the data.
Where this is different, it is noted. For example
geom_line
Connect observations, ordered by x value.
and
geom_path
Connect observations in data order
There are also known issues regarding the ordering of factors, and it is interesting to note the response of the package author Hadley
The display of a plot should be invariant to the order of the data frame - anything else is a bug.
This quote in mind, a layer is drawn in the specified order, so overplotting can be an issue, especially when creating dense scatter plots. So if you want a consistent plot (and not one that relies on the order in the data frame) you need to think a bit more.
Create a second layer
If you want certain values to appear above other values, you can use the subset argument to create a second layer to definitely be drawn afterwards. You will need to explicitly load the plyr package so .() will work.
set.seed(1234)
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
df$label <- c("a")
df$label[50] <- "point"
df$size <- 2
library(plyr)
ggplot(df) + geom_point(aes(x = x, y = y, color = label, size = size)) +
geom_point(aes(x = x, y = y, color = label, size = size),
subset = .(label == 'point'))
Update
In ggplot2_2.0.0, the subset argument is deprecated. Use e.g. base::subset to select relevant data specified in the data argument. And no need to load plyr:
ggplot(df) +
geom_point(aes(x = x, y = y, color = label, size = size)) +
geom_point(data = subset(df, label == 'point'),
aes(x = x, y = y, color = label, size = size))
Or use alpha
Another approach to avoid the problem of overplotting would be to set the alpha (transparancy) of the points. This will not be as effective as the explicit second layer approach above, however, with judicious use of scale_alpha_manual you should be able to get something to work.
eg
# set alpha = 1 (no transparency) for your point(s) of interest
# and a low value otherwise
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size,alpha = label)) +
scale_alpha_manual(guide='none', values = list(a = 0.2, point = 1))
The fundamental question here can be rephrased like this:
How do I control the layers of my plot?
In the 'ggplot2' package, you can do this quickly by splitting each different layer into a different command. Thinking in terms of layers takes a little bit of practice, but it essentially comes down to what you want plotted on top of other things. You build from the background upwards.
Prep: Prepare the sample data. This step is only necessary for this example, because we don't have real data to work with.
# Establish random seed to make data reproducible.
set.seed(1)
# Generate sample data.
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
# Initialize 'label' and 'size' default values.
df$label <- "a"
df$size <- 2
# Label and size our "special" point.
df$label[50] <- "point"
df$size[50] <- 4
You may notice that I've added a different size to the example just to make the layer difference clearer.
Step 1: Separate your data into layers. Always do this BEFORE you use the 'ggplot' function. Too many people get stuck by trying to do data manipulation from with the 'ggplot' functions. Here, we want to create two layers: one with the "a" labels and one with the "point" labels.
df_layer_1 <- df[df$label=="a",]
df_layer_2 <- df[df$label=="point",]
You could do this with other functions, but I'm just quickly using the data frame matching logic to pull the data.
Step 2: Plot the data as layers. We want to plot all of the "a" data first and then plot all the "point" data.
ggplot() +
geom_point(
data=df_layer_1,
aes(x=x, y=y),
colour="orange",
size=df_layer_1$size) +
geom_point(
data=df_layer_2,
aes(x=x, y=y),
colour="blue",
size=df_layer_2$size)
Notice that the base plot layer ggplot() has no data assigned. This is important, because we are going to override the data for each layer. Then, we have two separate point geometry layers geom_point(...) that use their own specifications. The x and y axis will be shared, but we will use different data, colors, and sizes.
It is important to move the colour and size specifications outside of the aes(...) function, so we can specify these values literally. Otherwise, the 'ggplot' function will usually assign colors and sizes according to the levels found in the data. For instance, if you have size values of 2 and 5 in the data, it will assign a default size to any occurrences of the value 2 and will assign some larger size to any occurrences of the value 5. An 'aes' function specification will not use the values 2 and 5 for the sizes. The same goes for colors. I have exact sizes and colors that I want to use, so I move those arguments into the 'geom_plot' function itself. Also, any specifications in the 'aes' function will be put into the legend, which can be really useless.
Final note: In this example, you could achieve the wanted result in many ways, but it is important to understand how 'ggplot2' layers work in order to get the most out of your 'ggplot' charts. As long as you separate your data into different layers before you call the 'ggplot' functions, you have a lot of control over how things will be graphed on the screen.
It's plotted in order of the rows in the data.frame. Try this:
df2 <- rbind(df[-50,],df[50,])
ggplot(df2) + geom_point(aes(x=x, y=y, color=label, size=size))
As you see the green point is drawn last, since it represents the last row of the data.frame.
Here is a way to order the data.frame to have the green point drawn first:
df2 <- df[order(-as.numeric(factor(df$label))),]

Highlight a particular plot among multiple plots

I've made a plot using a data frame and ggplot. Here's the plot for example
I'll be using this in a presentation. In one slide, I'm going to talk about epsilon=0.1, and in the next I'll be talking about epsilon=0.5. My question is: How do I make one particular plot thicker? i.e. I wish to create a plot where the orange graph corresponding to epsilon=0.1 is thick (and thus highlighted), so the audience knows that is the graph I'm referring to.
What I would do is add an additional column to the data, thickness, which you can assign to the size aesthetic of geom_line. You simply assign a higher value to the values in thickness where epsilon equals 0.1:
df$thickness = ifelse(df$epsilon == 0.1, 2, 1)
and use it in aes() of geom_line():
ggplot(df,aes(x,y,color=as.factor(epsilon))) +
geom_line(aes(size = thickness)) + scale_size_identity()
You can simply change the value in the call to ifelse to change which line get's highlighted. Note the use of scale_size_identity to prevent ggplot from scaling the values, and simply using the values in thickness as such.
An example with the built-in dataset mtcars:
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_line(aes(size = ifelse(mtcars$cyl == 6))) +
scale_size_identity()

ggplot boxplot + fill + color brewer spectrum

I can't seem to be able to fill a boxplot by a continuous value using color brewer, and I know it must just be a simple swap of syntax somewhere, since I can get the outlines of the boxes to adjust based on continuous values. Here's the data I'm working with:
data <- data.frame(
value = sample(1:50),
animals = sample(c("cat","dog","zebra"), 50, replace = TRUE),
region = sample(c("forest","desert","tundra"), 50, replace = TRUE)
)
I want to make a paneled boxplot, ordered by median "value", with the depth of color fill for each box increasing with "value" (I know this is redundant, but bear with me for the sake of the example)
(Ordering the data):
orderindex <- order(as.numeric(by(data$value, data$animals, median)))
data$animals <- ordered(data$animals, levels=levels(data$animals)[orderindex])
If I create the boxplot with panels, I can adjust the color of the outlines:
library(ggplot2)
first <- qplot(animals, value, data = data, colour=animals)
second <- first + geom_boxplot() + facet_grid(~region)
third <- second + scale_colour_brewer()
print(third)
But I want to do what I did to the outlines, but instead with the fill of each box (so each box gets darker as "value" increases). I thought that it might be a matter of putting the "scale_colour_brewer()" argument within the aesthetic argument for geom_boxplot, ie
second <- first + geom_boxplot(aes(scale_colour_brewer())) + facet_grid(~region)
but that doesn't seem to do the trick. I know it's a matter of positioning for this "scale_colour_brewer" argument; I just don't know where it goes!
(there is a similar question here but it's not quite what I'm looking for, since the colors of the box don't increase along a spectrum/gradient with some continuous value; it looks like these values are basically factors: Add color to boxplot - "Continuous value supplied to discrete scale" error, and the example at the ggplot site with the cars package:
http://docs.ggplot2.org/0.9.3.1/geom_boxplot.html doesn't seem to work when I set "fill" to "value" ... I get the error:
Error in unit(tic_pos.c, "mm") : 'x' and 'units' must have length > 0)
)
If you need to set fill for the boxplots then instead of color=animals use fill=animals and the same way replace scale_color_brewer() with scale_fill_brewer().
qplot(animals, value, data = data, fill=animals)+
geom_boxplot() + facet_grid(~region) + scale_fill_brewer()

Resources