ggplot2 plots more points than asked - r

I am trying to fill a square region with non-overlapping squares with different colors and ggplot2 is plotting more points than those in the dataframe at the higher x and y limits. Here is the code
l = 1000
a=seq(0,1, 1/(l-1))
x=rep(a, each=length(a))
y=rep(a, length(a))
k = length(x)
c=sample(1:10, k, replace = TRUE)
data <- data.frame(x, y, c)
ggplot(data, aes(x=x, y=y)) + geom_point(shape=15, color=c)
ggsave('k.jpg', width=10, height=10)
The result I am getting with RStudio is this. Notice the extra points on the right and top of the image.
How can I get ggplot to plot exactly one square exclusively for those points in the dataframe and not more?
As a second related question, this is what happens if l is changed from 1000 to l=100
My problem is now that the squares are not perfectly stacked, leaving empty space between them. I would like to know how can I compute from the number of points in each dimension of the array (l), the correct value for size inside geom_point so that the squares are perfectly stacked.
Many thanks

You might be better off with geom_tile, rather than geom_point, as this will allow more control over the size of the rectangles and the border width. See ?geom_tile for details.

Providing a couple of alternatives using OP's example, reducing the data frame dimension to increase the size of the tile:
Data
library(ggplot2)
l = 100
a = seq(0, 1, 1 / (l - 1))
x = rep(a, each = length(a))
y = rep(a, length(a))
k = length(x)
c = sample(1:10, k, replace = TRUE)
data <- data.frame(x, y, c)
Example 1
Very simple, just pasing "white" as colour to make the tiles more distinctive.
ggplot(data, aes(x = x, y = y, fill = c)) + geom_tile(colour = "white")
Example 2
Creating manually a palette, and coord_equal to force a specified ratio (default 1) so tiles are squares:
colors<-c("peachpuff", "yellow", "orange", "orangered", "red",
"darkred","firebrick", "royalblue", "darkslategrey", "black")
ggplot(data, aes(x = x, y = y)) +
geom_tile(aes(fill = factor(c)), colour = "white") +
scale_fill_manual(values = colors, name = "Colours") +
coord_equal()
Comparing geom_point and geom_tile
Creating small data frame (10 x 10, l = 10) to observe closer what happens when using geom_point instead of geom_tile.
Original OP code
ggplot(data, aes(x = x, y = y)) + geom_point(shape = 15, color = c)
Example 1
ggplot(data, ae(x = x, y = y, fill = c)) + geom_tile(colour = "white")
Example 2
colors<-c("peachpuff", "yellow", "orange", "orangered", "red",
"darkred","firebrick", "royalblue", "darkslategrey", "black")
ggplot(data, aes(x = x, y = y)) +
geom_tile(aes(fill = factor(c)), colour = "white") +
scale_fill_manual(values = colors, name = "Colours") +
coord_equal()

Related

ggplot2 legend: combine discrete colors and continuous point size

There are similar posts to this, namely here and here, but they address instances where both point color and size are continuous. Is it possible to:
Combine discrete colors and continuous point size within a single legend?
Within that same legend, add a description to each point in place of the numerical break label?
Toy data
xval = as.numeric(c("2.2", "3.7","1.3"))
yval = as.numeric(c("0.3", "0.3", "0.2"))
color.group = c("blue", "red", "blue")
point.size = as.numeric(c("200", "11", "100"))
description = c("descript1", "descript2", "descript3")
df = data.frame(xval, yval, color.group, point.size, description)
ggplot(df, aes(x=xval, y=yval, size=point.size)) +
geom_point(color = df$color.group) +
scale_size_continuous(limits=c(0, 200), breaks=seq(0, 200, by=50))
Doing what you originally asked - continuous + discrete in a single legend - in general doesn't seem to be possible even conceptually. The only sensible thing would be to have two legends for size, with a different color for each legend.
Now let's consider having a single legend. Given your "In my case, each unique combination of point size + color is associated with a description.", it sounds like there are very few possible point sizes. In that case, you could use both scales as discrete. But I believe even that is not enough as you use different variables for size and color scales. A solution then would be to create a single factor variable with all possible combinations of color.group and point.size. In particular,
df <- data.frame(xval, yval, f = interaction(color.group, point.size), description)
ggplot(df, aes(x = xval, y = yval, size = f, color = f)) +
geom_point() + scale_color_discrete(labels = 1:3) +
scale_size_discrete(labels = 1:3)
Here 1:3 are those descriptions that you want, and you may also set the colors the way you like. For instance,
ggplot(df, aes(x = xval, y = yval, size = f, color = f)) +
geom_point() + scale_size_discrete(labels = 1:3) +
scale_color_manual(labels = 1:3, values = c("red", "blue", "green"))
However, we may also exploit color.group by using
ggplot(df, aes(x = xval, y = yval, size = f, color = f)) +
geom_point() + scale_size_discrete(labels = 1:3) +
scale_color_manual(labels = 1:3, values = gsub("(.*)\\..*", "\\1", sort(df$f)))

ggplot2 colorbar with discontinuous jump for skewed data

Here is some fake data, x and y, with color information z. z is highly skewed, and as such renders the colorbar uninformative:
set.seed(1)
N <- 100
x <- rnorm(N)
y <- x + rnorm(N)
z <- x+y+rnorm(N)
z[z>2] <- z[z>2]+exp(z[z>2]-2)
d <- data.frame(x,y,z)
ggplot(d, aes(x=x, y=y, color = z)) + geom_point()
I'd like to have most of the colorbar reflect the main range of the the data, but have a box for overflows, say above 5. Something like this:
Is there a way to do this in ggplot2? Note that I would like the colorbar to remain continuous, rather than discrete, for most of its range. I'll probably either discretize or topcode if what I want isn't feasible.
You can get that general plot, although the legends would need more work:
p <- ggplot(d, aes(x=x, y=y, color = z)) + geom_point(size = 5)
p + scale_color_gradient2(
low = 'green', high = 'red', mid = 'grey80', na.value = 'blue', limits= c(-10, 10)
)
You can cheat in some extra legend fluff, e.g.:
ggplot(d, aes(x=x, y=y, color = z, alpha = '>10')) +
geom_point(size = 5) +
scale_color_gradient2(
low = 'green', high = 'red', mid = 'grey80', na.value = 'blue', limits= c(-10, 10),
guide = guide_colorbar(title.position = 'left')
) +
scale_alpha_manual(
values = 1, name = 'z',
guide = guide_legend(
override.aes = list(color = 'blue'), title.position = 'left',
title.theme = element_text(color = 'white', angle = 0)
)
) +
theme(legend.margin = margin(-5, 10, -5, 10))
Note that red/green pallets are bad for the color impaired.
Extending upon Axeman's answer I came up with the following slight hack to get blues into your color scale:
First, define a color map with 20 colors for the values within and 5 for the values outside your range.
cmap <- colorRampPalette(c("green","grey80","red"))(20)
cmap <- append(cmap,rep("blue",5))
Then cut the z values into 20 chunks between -10 and 10 and convert to numeric (resulting in NA's for values above 10). By specifying the cmap in scale_color_gradientn and limits of [1,25] we map values of -10 to 1 (green) and 10 to 20 (red). Finally by specifying breaks we manually add the correct labels (i.e. the 5th category corresponds to values between -6 and -5).
ggplot(d, aes(x=x, y=y, color=as.numeric(cut(z, breaks=seq(-10,10))))) +
geom_point(size=3) +
scale_color_gradientn(colors=cmap, limits=c(1,25), breaks=c(5,11,17,23),
labels=c(-6,0,6,">10"), name="z", na.value = "blue")
Lovely result :)
The only issue is that you will have to make sure that no values will ever fall below -10 as they would also be shown in blue as well using this method.

Create a colour blind test with ggplot

I would like to create a colour blind test, similar to that below, using ggplot.
The basic idea is to use geom_hex (or perhaps a voronoi diagram, or possibly even circles as in the figure above) as the starting point, and define a dataframe that, when plotted in ggplot, produces the image.
We would start by creating a dataset, such as:
df <- data.frame(x = rnorm(10000), y = rnorm(10000))
then plot this:
ggplot(df, aes(x, y)) +
geom_hex() +
coord_equal() +
scale_fill_gradient(low = "red", high = "green", guide = FALSE) +
theme_void()
which gives the image below:
The main missing step is to create a dataset that actually plots a meaningful symbol (letter or number), and I'm not sure how best to go about this without painstakingly mapping the coordinates. Ideally one would be able to read in the coordinates perhaps from an image file.
Finally, a bit of tidying up could round the plot edges by removing the outlying points.
All suggestions are very welcome!
EDIT
Getting a little closer to what I'm after, we can use the image below of the letter 'e':
Using the imager package, we can read this in and convert it to a dataframe:
img <- imager::load.image("e.png")
df <- as.data.frame(img)
then plot that dataframe using geom_raster:
ggplot(df, aes(x, y)) +
geom_raster(aes(fill = value)) +
coord_equal() +
scale_y_continuous(trans = scales::reverse_trans()) +
scale_fill_gradient(low = "red", high = "green", guide = FALSE) +
theme_void()
If we use geom_hex instead of geom_raster, we can get the following plot:
ggplot(df %>% filter(value %in% 1), aes(x, y)) +
geom_hex() +
coord_equal() +
scale_y_continuous(trans = scales::reverse_trans()) +
scale_fill_gradient(low = "red", high = "green", guide = FALSE) +
theme_void()
so, getting there but clearly still a long way off...
Here's an approach for creating this plot:
Packages you need:
library(tidyverse)
library(packcircles)
Get image into a 2D matrix (x and y coordinates) of values. To do this, I downloaded the .png file of the e as "e.png" and saved in my working directory. Then some processing:
img <- png::readPNG("e.png")
# From http://stackoverflow.com/questions/16496210/rotate-a-matrix-in-r
rotate <- function(x) t(apply(x, 2, rev))
# Convert to one colour layer and rotate it to be in right direction
img <- rotate(img[,,1])
# Check that matrix makes sense:
image(img)
Next, create a whole lot of circles! I did this based on this post.
# Create random "circles"
# *** THESE VALUES WAY NEED ADJUSTING
ncircles <- 1200
offset <- 100
rmax <- 80
x_limits <- c(-offset, ncol(img) + offset)
y_limits <- c(-offset, nrow(img) + offset)
xyr <- data.frame(
x = runif(ncircles, min(x_limits), max(x_limits)),
y = runif(ncircles, min(y_limits), max(y_limits)),
r = rbeta(ncircles, 1, 10) * rmax)
# Find non-overlapping arrangement
res <- circleLayout(xyr, x_limits, y_limits, maxiter = 1000)
cat(res$niter, "iterations performed")
#> 1000 iterations performed
# Convert to data for plotting (just circles for now)
plot_d <- circlePlotData(res$layout)
# Check circle arrangement
ggplot(plot_d) +
geom_polygon(aes(x, y, group=id), colour = "white", fill = "skyblue") +
coord_fixed() +
theme_minimal()
Finally, interpolate the image pixel values for the centre of each circle. This will indicate whether a circle is centered over the shape or not. Add some noise to get variance in colour and plot.
# Get x,y positions of centre of each circle
circle_positions <- plot_d %>%
group_by(id) %>%
summarise(x = min(x) + (diff(range(x)) / 2),
y = min(y) + (diff(range(y)) / 2))
# Interpolate on original image to get z value for each circle
circle_positions <- circle_positions %>%
mutate(
z = fields::interp.surface(
list(x = seq(nrow(img)), y = seq(ncol(img)), z = img),
as.matrix(.[, c("x", "y")])),
z = ifelse(is.na(z), 1, round(z)) # 1 is the "empty" area shown earlier
)
# Add a little noise to the z values
set.seed(070516)
circle_positions <- circle_positions %>%
mutate(z = z + rnorm(n(), sd = .1))
# Bind z value to data for plotting and use as fill
plot_d %>%
left_join(select(circle_positions, id, z)) %>%
ggplot(aes(x, y, group = id, fill = z)) +
geom_polygon(colour = "white", show.legend = FALSE) +
scale_fill_gradient(low = "#008000", high = "#ff4040") +
coord_fixed() +
theme_void()
#> Joining, by = "id"
To get colours right, tweak them in scale_fill_gradient

Automated way to prevent ggplot hexbin from cutting geoms off axes

This is a slightly different question from an earlier post(ggplot hexbin shows different number of hexagons in plot versus data frame).
I am using hexbin() to bin data into hexagon objects, and ggplot() to plot the results. I notice that, sometimes, the hexagons on the edge of the plot are cut in half. Below is an example.
library(hexbin)
library(ggplot2)
set.seed(1)
data <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100), D=rnorm(100), E=rnorm(100))
maxVal = max(abs(data))
maxRange = c(-1*maxVal, maxVal)
x = data[,c("A")]
y = data[,c("E")]
h <- hexbin(x=x, y=y, xbins=5, shape=1, IDs=TRUE, xbnds=maxRange, ybnds=maxRange)
hexdf <- data.frame (hcell2xy (h), hexID = h#cell, counts = h#count)
ggplot(hexdf, aes(x = x, y = y, fill = counts, hexID = hexID)) +
geom_hex(stat = "identity") +
coord_cartesian(xlim = c(maxRange[1], maxRange[2]), ylim = c(maxRange[1], maxRange[2]))
This creates a graphic where one hexagon is cut off at the top and one hexagon is cut off at the bottom:
Another approach I can try is to hard-code a value (here 1.5) to be added to the limits of the x and y axis. Doing so does seem to solve the problem in that no hexagons are cut off anymore.
ggplot(hexdf, aes(x = x, y = y, fill = counts, hexID = hexID)) +
geom_hex(stat = "identity") +
scale_x_continuous(limits = maxRange * 1.5) +
scale_y_continuous(limits = maxRange * 1.5)
However, even though the second approach solves the problem in this instance, the value of 1.5 is arbitrary. I am trying to automate this process for a variety of data and variety of bin sizes and hexagon sizes that could be used. Is there a solution to keeping all hexagons fully visible in the plot without having to hard-code an arbitrary value that may be too large or too small for certain instances?
Consider that you can skip the computation of hexbin, and let ggplot do the job.
Then, if you prefer to manually set the width of the bins you can set the binwidth and modify the limits:
bwd = 1
ggplot(data, aes(x = x, y = y)) +
geom_hex(binwidth = bwd) +
coord_cartesian(xlim = c(min(x) - bwd, max(x) + bwd),
ylim = c(min(y) - bwd, max(y) + bwd),
expand = T) +
geom_point(color = "red") +
theme_bw()
this way, hexagons should never be truncated (though you may end up with some "empty" space.
Result with bwd = 1:
Result with bwd = 3:
If instead you prefer to programmatically set the number of the bins, you can use:
nbins_x <- 4
nbins_y <- 6
range_x <- range(data$A, na.rm = T)
range_y <- range(data$E, na.rm = T)
bwd_x <- (range_x[2] - range_x[1])/nbins_x
bwd_y <- (range_y[2] - range_y[1])/nbins_y
ggplot(data, aes(x = A, y = E)) +
geom_hex(bins = c(nbins_x,nbins_y)) +
coord_cartesian(xlim = c(range_x[1] - bwd_x, range_x[2] + bwd_x),
ylim = c(range_y[1] - bwd_y, range_y[2] + bwd_y),
expand = T) +
geom_point(color = "red")+
theme_bw()

Size of points in ggplot2 comparable across plots?

I am using ggplot2 to produce various plots in which the size of a point is proportional to the number of cases that have the same values of x and y. Is there a way to make the size of the points comparable across different plots that have different values of size?
Example using fake data:
df1 = data.frame(x = seq(1:10),
y = c(4,3.8,3.8,3.2,3.1,2.5,2,1.5,1.2,1.3),
size = c(1,20,1,70,100,70,1,1,110,1))
library(ggplot2)
pdf("plot.1.pdf")
ggplot(df1, aes(x = x, y = y, size = size)) + geom_point()
dev.off()
df2 = data.frame(x = seq(1:10),
y = c(4,3.8,3.8,3.2,3.1,2.5,2,1.5,1.2,1.3),
size = rep(1,length(y)))
pdf("plot.2.pdf")
ggplot(df2, aes(x = x, y = y, size = size)) + geom_point()
dev.off()
The points in Plot 1, which all have size equal to 1, are much larger than the points in Plot 2 for which size equals 1. I need a version of the plots where points with the same value of size have the same size across different plots. Thank you,
Sofia
One possibility is to use scale_size_identity() - that will interpret size directly as units of pointsize, so in both plots points with value 1 will be the same size. But this approach will make too large points if size values are big (as in your case). To deal with problem of too big points, you can use transformation inside scale, for example, square root, with argument trans="sqrt".
ggplot(df1, aes(x = x, y = y, size = size)) +
geom_point()+scale_size_identity(trans="sqrt",guide="legend")
ggplot(df2, aes(x = x, y = y, size = size)) +
geom_point()+scale_size_identity(trans="sqrt",guide="legend")
UPDATE
As pointed out by #hadley, easiest way to achieve this is to set limits= inside scale_size_continuous() to the same values to get identical sizes.
ggplot(df1, aes(x = x, y = y, size = size)) + geom_point()+
scale_size_continuous(limits=c(1,110))
ggplot(df2, aes(x = x, y = y, size = size)) + geom_point()+
scale_size_continuous(limits=c(1,110))

Resources