I've made a plot using a data frame and ggplot. Here's the plot for example
I'll be using this in a presentation. In one slide, I'm going to talk about epsilon=0.1, and in the next I'll be talking about epsilon=0.5. My question is: How do I make one particular plot thicker? i.e. I wish to create a plot where the orange graph corresponding to epsilon=0.1 is thick (and thus highlighted), so the audience knows that is the graph I'm referring to.
What I would do is add an additional column to the data, thickness, which you can assign to the size aesthetic of geom_line. You simply assign a higher value to the values in thickness where epsilon equals 0.1:
df$thickness = ifelse(df$epsilon == 0.1, 2, 1)
and use it in aes() of geom_line():
ggplot(df,aes(x,y,color=as.factor(epsilon))) +
geom_line(aes(size = thickness)) + scale_size_identity()
You can simply change the value in the call to ifelse to change which line get's highlighted. Note the use of scale_size_identity to prevent ggplot from scaling the values, and simply using the values in thickness as such.
An example with the built-in dataset mtcars:
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_line(aes(size = ifelse(mtcars$cyl == 6))) +
scale_size_identity()
Related
Related to this question.
If I create a gradient using colorRampPalette, is there a way to have ggplot2 automatically detect the number of colours it will need from this gradient?
In the example below, I have to specify 3 colours will be needed for the 3 cyl values. This requires me knowing ahead of time that I'll need this many. I'd like to not have to specify it and have ggplot detect the number it will need automatically.
myColRamp <- colorRampPalette(c('#a0e2f2', '#27bce1'))
ggplot(mtcars, aes(x = wt, y = mpg, col = as.factor(cyl))) +
geom_point(size = 3) +
scale_colour_manual(values = myColRamp(3)) # How to avoid having to specify 3?
I'm also open to options that don't use colorRampPalette but achieve the same functionality.
I see two options here. One which requires a little customisation. One which has more code but requires no customisation.
Option 1 - Determine number of unique factors from your specific variable
Simply use the length and unique functions to work out how many factors are in cyl.
values = myColRamp(length(unique(mtcars$cyl))
Option 2 - Build the plot, and see how many colours it used
If you don't want to specify the name of the variable, and want something more general, we can build the plot, and see how many colours ggplot used, then build it again.
To do this, we also have to save our plot as an object, let's call that plot object p.
p <- ggplot(mtcars, aes(x = wt, y = mpg, col = as.factor(cyl))) +
geom_point(size = 3)
#Notice I haven't set the colour option this time
p_built <- ggplot_build(p) #This builds the plot and saves the data based on
#the plot, so x data is called 'x', y is called 'y',
#and importantly in this case, colour is called the
#generic 'colour'.
#Now we can fish out that data and check how many colour levels were used
num_colours <- length(unique(p_built$data[[1]]$colour))
#Now we know how many colours were used, we can add the colour scale to our plot
p <- p + scale_colour_manual(values = myColRamp(num_colours))
Now either just call p or print(p) depending on your use to view it.
In trying to color my stacked histogram according to a factor column; all the bars have a "green" roof? I want the bar-top to be the same color as the bar itself. The figure below shows clearly what is wrong. All the bars have a "green" horizontal line at the top?
Here is a dummy data set :
BodyLength <- rnorm(100, mean = 50, sd = 3)
vector <- c("80","10","5","5")
colors <- c("black","blue","red","green")
color <- rep(colors,vector)
data <- data.frame(BodyLength,color)
And the program I used to generate the plot below :
plot <- ggplot(data = data, aes(x=data$BodyLength, color = factor(data$color), fill=I("transparent")))
plot <- plot + geom_histogram()
plot <- plot + scale_colour_manual(values = c("Black","blue","red","green"))
Also, since the data column itself contains color names, any way I don't have to specify them again in scale_color_manual? Can ggplot identify them from the data itself? But I would really like help with the first problem right now...Thanks.
Here is a quick way to get your colors to scale_colour_manual without writing out a vector:
data <- data.frame(BodyLength,color)
data$color<- factor(data$color)
and then later,
scale_colour_manual(values = levels(data$color))
Now, with respect to your first problem, I don't know exactly why your bars have green roofs. However, you may want to look at some different options for the position argument in geom_histogram, such as
plot + geom_histogram(position="identity")
..or position="dodge". The identity option is closer to what you want but since green is the last line drawn, it overwrites previous the colors.
I like density plots better for these problems myself.
ggplot(data=data, aes(x=BodyLength, color=color)) + geom_density()
ggplot(data=data, aes(x=BodyLength, fill=color)) + geom_density(alpha=.3)
I'm plotting a dense scatter plot in ggplot2 where each point might be labeled by a different color:
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
df$label <- c("a")
df$label[50] <- "point"
df$size <- 2
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size))
When I do this, the scatter point labeled "point" (green) is plotted on top of the red points which have the label "a". What controls this z ordering in ggplot, i.e. what controls which point is on top of which?
For example, what if I wanted all the "a" points to be on top of all the points labeled "point" (meaning they would sometimes partially or fully hide that point)? Does this depend on alphanumerical ordering of labels?
I'd like to find a solution that can be translated easily to rpy2.
2016 Update:
The order aesthetic has been deprecated, so at this point the easiest approach is to sort the data.frame so that the green point is at the bottom, and is plotted last. If you don't want to alter the original data.frame, you can sort it during the ggplot call - here's an example that uses %>% and arrange from the dplyr package to do the on-the-fly sorting:
library(dplyr)
ggplot(df %>%
arrange(label),
aes(x = x, y = y, color = label, size = size)) +
geom_point()
Original 2015 answer for ggplot2 versions < 2.0.0
In ggplot2, you can use the order aesthetic to specify the order in which points are plotted. The last ones plotted will appear on top. To apply this, you can create a variable holding the order in which you'd like points to be drawn.
To put the green dot on top by plotting it after the others:
df$order <- ifelse(df$label=="a", 1, 2)
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size, order=order))
Or to plot the green dot first and bury it, plot the points in the opposite order:
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size, order=-order))
For this simple example, you can skip creating a new sorting variable and just coerce the label variable to a factor and then a numeric:
ggplot(df) +
geom_point(aes(x=x, y=y, color=label, size=size, order=as.numeric(factor(df$label))))
ggplot2 will create plots layer-by-layer and within each layer, the plotting order is defined by the geom type. The default is to plot in the order that they appear in the data.
Where this is different, it is noted. For example
geom_line
Connect observations, ordered by x value.
and
geom_path
Connect observations in data order
There are also known issues regarding the ordering of factors, and it is interesting to note the response of the package author Hadley
The display of a plot should be invariant to the order of the data frame - anything else is a bug.
This quote in mind, a layer is drawn in the specified order, so overplotting can be an issue, especially when creating dense scatter plots. So if you want a consistent plot (and not one that relies on the order in the data frame) you need to think a bit more.
Create a second layer
If you want certain values to appear above other values, you can use the subset argument to create a second layer to definitely be drawn afterwards. You will need to explicitly load the plyr package so .() will work.
set.seed(1234)
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
df$label <- c("a")
df$label[50] <- "point"
df$size <- 2
library(plyr)
ggplot(df) + geom_point(aes(x = x, y = y, color = label, size = size)) +
geom_point(aes(x = x, y = y, color = label, size = size),
subset = .(label == 'point'))
Update
In ggplot2_2.0.0, the subset argument is deprecated. Use e.g. base::subset to select relevant data specified in the data argument. And no need to load plyr:
ggplot(df) +
geom_point(aes(x = x, y = y, color = label, size = size)) +
geom_point(data = subset(df, label == 'point'),
aes(x = x, y = y, color = label, size = size))
Or use alpha
Another approach to avoid the problem of overplotting would be to set the alpha (transparancy) of the points. This will not be as effective as the explicit second layer approach above, however, with judicious use of scale_alpha_manual you should be able to get something to work.
eg
# set alpha = 1 (no transparency) for your point(s) of interest
# and a low value otherwise
ggplot(df) + geom_point(aes(x=x, y=y, color=label, size=size,alpha = label)) +
scale_alpha_manual(guide='none', values = list(a = 0.2, point = 1))
The fundamental question here can be rephrased like this:
How do I control the layers of my plot?
In the 'ggplot2' package, you can do this quickly by splitting each different layer into a different command. Thinking in terms of layers takes a little bit of practice, but it essentially comes down to what you want plotted on top of other things. You build from the background upwards.
Prep: Prepare the sample data. This step is only necessary for this example, because we don't have real data to work with.
# Establish random seed to make data reproducible.
set.seed(1)
# Generate sample data.
df <- data.frame(x=rnorm(500))
df$y = rnorm(500)*0.1 + df$x
# Initialize 'label' and 'size' default values.
df$label <- "a"
df$size <- 2
# Label and size our "special" point.
df$label[50] <- "point"
df$size[50] <- 4
You may notice that I've added a different size to the example just to make the layer difference clearer.
Step 1: Separate your data into layers. Always do this BEFORE you use the 'ggplot' function. Too many people get stuck by trying to do data manipulation from with the 'ggplot' functions. Here, we want to create two layers: one with the "a" labels and one with the "point" labels.
df_layer_1 <- df[df$label=="a",]
df_layer_2 <- df[df$label=="point",]
You could do this with other functions, but I'm just quickly using the data frame matching logic to pull the data.
Step 2: Plot the data as layers. We want to plot all of the "a" data first and then plot all the "point" data.
ggplot() +
geom_point(
data=df_layer_1,
aes(x=x, y=y),
colour="orange",
size=df_layer_1$size) +
geom_point(
data=df_layer_2,
aes(x=x, y=y),
colour="blue",
size=df_layer_2$size)
Notice that the base plot layer ggplot() has no data assigned. This is important, because we are going to override the data for each layer. Then, we have two separate point geometry layers geom_point(...) that use their own specifications. The x and y axis will be shared, but we will use different data, colors, and sizes.
It is important to move the colour and size specifications outside of the aes(...) function, so we can specify these values literally. Otherwise, the 'ggplot' function will usually assign colors and sizes according to the levels found in the data. For instance, if you have size values of 2 and 5 in the data, it will assign a default size to any occurrences of the value 2 and will assign some larger size to any occurrences of the value 5. An 'aes' function specification will not use the values 2 and 5 for the sizes. The same goes for colors. I have exact sizes and colors that I want to use, so I move those arguments into the 'geom_plot' function itself. Also, any specifications in the 'aes' function will be put into the legend, which can be really useless.
Final note: In this example, you could achieve the wanted result in many ways, but it is important to understand how 'ggplot2' layers work in order to get the most out of your 'ggplot' charts. As long as you separate your data into different layers before you call the 'ggplot' functions, you have a lot of control over how things will be graphed on the screen.
It's plotted in order of the rows in the data.frame. Try this:
df2 <- rbind(df[-50,],df[50,])
ggplot(df2) + geom_point(aes(x=x, y=y, color=label, size=size))
As you see the green point is drawn last, since it represents the last row of the data.frame.
Here is a way to order the data.frame to have the green point drawn first:
df2 <- df[order(-as.numeric(factor(df$label))),]
I am using ggplot2 to create several plots about the same data. In particular I am interested in plotting observations according to a factor variable with 6 levels ("cluster").
But the plots produced by ggplot2 use different palettes every time!
For example, if I make a bar plot with this formula I get this result (this palette is what I expect to obtain):
qplot(cluster, data = data, fill = cluster) + ggtitle("Clusters")
And if I make a scatter plot and I try to color the observations according to their belonging to a cluster I get this result (notice that the color palette is different):
ggplot(data, aes(liens_ratio,RT_ratio)) +
geom_point(col=data$cluster, size=data$nombre_de_tweet/100+2) +
geom_smooth() +
ggtitle("Links - RTs")
Any idea on how to solve this issue?
I can't be certain this will work in your specific case without a reproducible example, but I'm reasonably confident that all you need to do is set your color inside an aes() call within the geom you want to color. That is,
ggplot(data, aes(x = liens_ratio, y = RT_ratio)) +
geom_point(aes(color = cluster, size = nombre_de_tweet/100+2)) +
geom_smooth() +
ggtitle("Links - RTs")
If all plots you make use the same data and this basic format, the color palette should be the same regardless of the geom used. Additional elements, such as the line from geom_smooth() will not be changed unless they are also explicitly colored.
The palette will just be the default one, of course; to change it look into scale_color_manual.
In theory I assign a color attribute to geom_line, there are two data points from which the color could be assigned. In practice, ggplot2 seems to be taking the first point's value and carrying it forward as the line's color. Is there a way to use the second point's attribute value to assign the color instead of the first one?
ggplot(data, aes(x = timeVal, y = yVal, group = groupVal, color = colorVal)) + geom_line()
First, we will define some sample data to make a reproducible example
set.seed(15)
dd<-data.frame(x=rep(1:5, 2), y=cumsum(runif(10)),
group=rep(letters[1:2],each=5), other=sample(letters[1:4], 10, replace=T))
Explicit Transformation
I guess if I want to color each segment individually, I'd prefer to be explicit and to the transformation myself. I would do a transformation like
d2 <- do.call(rbind, lapply(split(dd, dd$group), function(x)
data.frame(
X=embed(x$x,2),
Y=embed(x$y, 2),
OTHER=x$other[-1],
GROUP=x$group[-1])
))
And then I can compare plots
ggplot(dd, aes(x,y,group=group, color=other)) +
geom_line() + ggtitle("Default")
ggplot(d2, aes(x=X.1, xend=X.2,y=Y.1,yend=Y.2, color=OTHER)) +
geom_segment() + ggtitle("Transformed")
Reverse Sort + geom_path
An alternative is to use goem_path rather than geom_line. The latter has an explicit sort along the x and it only uses the start point to get color information. So if you reverse sort the points and then use geom_path to avoid the sort, you have better control of what goes first and therefore have more control over which point the color properties come from
ggplot(dd[order(dd$group, -dd$x), ], aes(x,y,group=group, color=other)) +
geom_path() + ggtitle("Reversed")