ggplot hexbin shows different number of hexagons in plot versus data frame - r

I am using hexbin() to bin data into hexagon objects, and ggplot() to plot the results. I notice that, sometimes, the binning data frame contains a different number of hexagons than the plot that results from plotting that same binning data frame. Below is an example.
library(hexbin)
library(ggplot2)
set.seed(1)
data <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100), D=rnorm(100), E=rnorm(100))
maxVal = max(abs(data))
maxRange = c(-1*maxVal, maxVal)
x = data[,c("A")]
y = data[,c("E")]
h <- hexbin(x=x, y=y, xbins=5, shape=1, IDs=TRUE, xbnds=maxRange, ybnds=maxRange)
hexdf <- data.frame (hcell2xy (h), hexID = h#cell, counts = h#count)
# Both objects below indicate there are 17 hexagons
# hexdf
# table(h#cID)
# However, plotting only shows 16 hexagons
ggplot(hexdf, aes(x=x, y=y, fill = counts, hexID=hexID)) + geom_hex(stat="identity") + scale_x_continuous(limits = maxRange) + scale_y_continuous(limits = maxRange)
In this example, the hexdf data frame contains 17 hexagons. However, the ggplot(hexdf) resulting plot only shows 16 hexagons, as is shown below.
Note: Syntax in the above example may seem cumbersome, but some of it is because this is a MWE for a more complex goal and I am intentionally keeping those components so that any possible solution might extend to my more complex goal. For instance, I want to maintain the capability to allow for the maxRange variable to be computed from the original data frame called data (which contains additional columns "B", "C", and "D"). At the same time, there may be parts of my syntax that are unnecessarily cumbersome and may be causing the problem - so I am happy to try to fix them to see.
Any ideas what might be causing this discrepancy and how to fix it? Thank you!

The last hexagon is missing as it's (partly) outside the limits you set. It's included if you change the limits, e.g. like so:
ggplot(hexdf, aes(x = x, y = y, fill = counts, hexID = hexID)) +
geom_hex(stat = "identity") +
scale_x_continuous(limits = maxRange * 1.5) +
scale_y_continuous(limits = maxRange * 1.5)
or by using coord_cartesian instead:
ggplot(hexdf, aes(x = x, y = y, fill = counts, hexID = hexID)) +
geom_hex(stat = "identity") +
coord_cartesian(xlim = c(maxRange[1], maxRange[2]), ylim = c(maxRange[1], maxRange[2]))

Related

Convert and represent the data in different unit system in R

I am plotting various plots in the shiny app that I have developed, the raw dataset that I have, have all the data points in meters, for eg. one of my raw data set looks like this:
df <- data.frame(X = c(0.000000000000,4.99961330240E-005,9.99922660480E-005,0.000149988399072,0.00019998453209, 0.000249980665120,0.000299976798144,0.000349972931168,0.000399969064192,0.000449965197216,0.000499961330240,0.000549957463264,0.000599953596288,0.000649949729312,0.000699945862336,0.000749941995360,0.000799938128384,0.000849934261408,0.000899930394432,0.000949926527456,0.000999922660480,0.00104991879350,0.00109991492653,0.00114991105955),
Y = c(0.00120303964354,0.00119632557146,0.00119907223731,0.00120059816279,0.00119785149693,0.00119876705222,0.00119327372051,0.00118900112918,0.00118930631428,0.00119174779504,0.00119113742485,0.00119541001617,0.00119815668203,0.00119052705466,0.00119205298013,0.00118930631428,0.00119174779504,0.00119388409070,0.00118778038881,0.00122287667470,0.00122684408094,0.00122623371075,0.00122867519150,0.00122379222999))
My attempt to plot:
g <- ggplot(data = df) + theme_bw() +
geom_point(aes_string(x= df[,1], y= df[,2]), colour= "red", size = 0.1)
ggplotly(g)
And the plot looks like this:
What I want:
The data that I have in the datafile is in meters, but on the plot, I need Y-axis data to be shown in Micrometer and X-axis data to be shown in Millimeter. And the dataframe that I have illustrated above is just a small part of my actual dataframe. In the actual dataframe, data is very big.
Is there any way we can do this automatically without having the user to change the units manually?
In the end, I want 'Y' values to be multiplied by 10^6 and 'X' value to be multiplied by 10^3 in order to convert them into micrometers and millimeters respectively.
I got two possible answers to my question:
1st is:
g <- ggplot(data = df) + theme_bw() +
geom_point(aes_string(x= df[,1]*10^3, y= df[,2]*10^6), colour= "red", size = 0.1)
ggplotly(g)
2nd is:
M <- data.frame(x= df[,1]* 10^3, y= df[,2]* 10^6)
g <- ggplot(data = M) + theme_bw() +
geom_point(aes_string(x= M[,1], y= M[,2]), colour= "red", size = 0.1)
ggplotly(g)

Automated way to prevent ggplot hexbin from cutting geoms off axes

This is a slightly different question from an earlier post(ggplot hexbin shows different number of hexagons in plot versus data frame).
I am using hexbin() to bin data into hexagon objects, and ggplot() to plot the results. I notice that, sometimes, the hexagons on the edge of the plot are cut in half. Below is an example.
library(hexbin)
library(ggplot2)
set.seed(1)
data <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100), D=rnorm(100), E=rnorm(100))
maxVal = max(abs(data))
maxRange = c(-1*maxVal, maxVal)
x = data[,c("A")]
y = data[,c("E")]
h <- hexbin(x=x, y=y, xbins=5, shape=1, IDs=TRUE, xbnds=maxRange, ybnds=maxRange)
hexdf <- data.frame (hcell2xy (h), hexID = h#cell, counts = h#count)
ggplot(hexdf, aes(x = x, y = y, fill = counts, hexID = hexID)) +
geom_hex(stat = "identity") +
coord_cartesian(xlim = c(maxRange[1], maxRange[2]), ylim = c(maxRange[1], maxRange[2]))
This creates a graphic where one hexagon is cut off at the top and one hexagon is cut off at the bottom:
Another approach I can try is to hard-code a value (here 1.5) to be added to the limits of the x and y axis. Doing so does seem to solve the problem in that no hexagons are cut off anymore.
ggplot(hexdf, aes(x = x, y = y, fill = counts, hexID = hexID)) +
geom_hex(stat = "identity") +
scale_x_continuous(limits = maxRange * 1.5) +
scale_y_continuous(limits = maxRange * 1.5)
However, even though the second approach solves the problem in this instance, the value of 1.5 is arbitrary. I am trying to automate this process for a variety of data and variety of bin sizes and hexagon sizes that could be used. Is there a solution to keeping all hexagons fully visible in the plot without having to hard-code an arbitrary value that may be too large or too small for certain instances?
Consider that you can skip the computation of hexbin, and let ggplot do the job.
Then, if you prefer to manually set the width of the bins you can set the binwidth and modify the limits:
bwd = 1
ggplot(data, aes(x = x, y = y)) +
geom_hex(binwidth = bwd) +
coord_cartesian(xlim = c(min(x) - bwd, max(x) + bwd),
ylim = c(min(y) - bwd, max(y) + bwd),
expand = T) +
geom_point(color = "red") +
theme_bw()
this way, hexagons should never be truncated (though you may end up with some "empty" space.
Result with bwd = 1:
Result with bwd = 3:
If instead you prefer to programmatically set the number of the bins, you can use:
nbins_x <- 4
nbins_y <- 6
range_x <- range(data$A, na.rm = T)
range_y <- range(data$E, na.rm = T)
bwd_x <- (range_x[2] - range_x[1])/nbins_x
bwd_y <- (range_y[2] - range_y[1])/nbins_y
ggplot(data, aes(x = A, y = E)) +
geom_hex(bins = c(nbins_x,nbins_y)) +
coord_cartesian(xlim = c(range_x[1] - bwd_x, range_x[2] + bwd_x),
ylim = c(range_y[1] - bwd_y, range_y[2] + bwd_y),
expand = T) +
geom_point(color = "red")+
theme_bw()

Violin plots with additional points

Suppose I make a violin plot, with say 10 violins, using the following code:
library(ggplot2)
library(reshape2)
df <- melt(data.frame(matrix(rnorm(500),ncol=10)))
p <- ggplot(df, aes(x = variable, y = value)) +
geom_violin()
p
I can add a dot representing the mean of each variable as follows:
p + stat_summary(fun.y=mean, geom="point", size=2, color="red")
How can I do something similar but for arbitrary points?
For example, if I generate 10 new points, one drawn from each distribution, how could I plot those as dots on the violins?
You can give any function to stat_summary provided it just returns a single value. So one can use the function sample. Put extra arguments such as size, in the fun.args
p + stat_summary(fun.y = "sample", geom = "point", fun.args = list(size = 1))
Assuming your points are qualified using the same group names (i.e., variable), you should be able to define them manually with:
newdf <- group_by(df, variable) %>% sample_n(10)
p + geom_point(data=newdf)
The points can be anything, including static numbers:
newdf <- data.frame(variable = unique(df$variable), value = seq(-2, 2, len=10))
p + geom_point(data=newdf)
I had a similar problem. Code below exemplifies the toy problem - How does one add arbitrary points to a violin plot? - and solution.
## Visualize data set that comes in base R
head(ToothGrowth)
## Make a violin plot with dose variable on x-axis, len variable on y-axis
# Convert dose variable to factor - Important!
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
# Plot
p <- ggplot(ToothGrowth, aes(x=dose, y=len)) +
geom_violin(trim = FALSE) +
geom_boxplot(width=0.1)
# Suppose you want to add 3 blue points
# [0.5, 10], [1,20], [2, 30] to the plot.
# Make a new data frame with these points
# and add them to the plot with geom_point().
TrueVals <- ToothGrowth[1:3,]
TrueVals$len <- c(10,20,30)
# Make dose variable a factor - Important for positioning points correctly!
TrueVals$dose <- as.factor(c(0.5, 1, 2))
# Plot with 3 added blue points
p <- ggplot(ToothGrowth, aes(x=dose, y=len)) +
geom_violin(trim = FALSE) +
geom_boxplot(width=0.1) +
geom_point(data = TrueVals, color = "blue")

Adding geom_point() to geom_hex()

I am creating a plot of hexbin data in R, in which the color represents the number of data points in each hexbin. I seem to have this working as is shown in the MWE below:
library(hexbin)
library(ggplot2)
set.seed(1)
data <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100), D=rnorm(100), E=rnorm(100))
maxVal = max(abs(data))
maxRange = c(-1*maxVal, maxVal)
x = data[,c("A")]
y = data[,c("E")]
h <- hexbin(x=x, y=y, xbins=5, shape=1, IDs=TRUE, xbnds=maxRange, ybnds=maxRange)
hexdf <- data.frame (hcell2xy (h), hexID = h#cell, counts = h#count)
p <- ggplot(hexdf, aes(x=x, y=y, fill = counts, hexID=hexID)) + geom_hex(stat="identity") + coord_cartesian(xlim = c(maxRange[1], maxRange[2]), ylim = c(maxRange[1], maxRange[2]))
I am now trying to superimpose a subset of the original data in the form of points on top of the hexbin plot. I first create the subset of the original data as follows:
dat <- data[c(1:5),]
Then, I tried to plot these five data points onto the hexbin plot, p:
p + geom_point(data = dat, aes(x=A, y=B))
For which I receive the Error: "Error in eval(expr, envir, enclos) : object 'counts' not found". I also tried the following:
p + geom_point() + geom_point(dat, aes(A, B))
For which I receive the Error: "Error: ggplot2 doesn't know how to deal with data of class uneval".
I tried several new ideas based on similar posts - but would always have an error and no resulting plot. I am wondering if such a technique is possible. If anyone has ideas to share, I would very much appreciate your input!
To solve this problem, we need to set inherit.aes = FALSE in your geom_point call. Basically, you've set the fill aesthetic equal to count in your ggplot call, so when ggplot tries to add the points to the plot, it looks for count in dat. ggplot is telling you "hey, I can't find count in this data set, so I can't add that geom since it's missing an aes".
p + geom_point(data = dat, aes(x=A, y=B),
inherit.aes = FALSE)
Or, we could define p as:
p <- ggplot() +
geom_hex(data = hexdf, aes(x=x, y=y, fill = counts), stat="identity") +
coord_cartesian(xlim = c(maxRange[1], maxRange[2]), ylim = c(maxRange[1], maxRange[2]))
And then we wouldn't need inhert.aes:
p + geom_point(data = dat, aes(x = A, y = B))

Fill specific regions in geom_violin plot

How can I fill a geom_violin plot in ggplot2 with different colors based on a fixed cutoff?
For instance, given the setup:
library(ggplot2)
set.seed(123)
dat <- data.frame(x = rep(1:3,each = 100),
y = c(rnorm(100,-1),rnorm(100,0),rnorm(100,1)))
dat$f <- with(dat,ifelse(y >= 0,'Above','Below'))
I'd like to take this basic plot:
ggplot() +
geom_violin(data = dat,aes(x = factor(x),y = y))
and simply have each violin colored differently above and below zero. The naive thing to try, mapping the fill aesthetic, splits and dodges the violin plots:
ggplot() +
geom_violin(data = dat,aes(x = factor(x),y = y, fill = f))
which is not what I want. I'd like a single violin plot at each x value, but with the interior filled with different colors above and below zero.
Here's one way to do this.
library(ggplot2)
library(plyr)
#Data setup
set.seed(123)
dat <- data.frame(x = rep(1:3,each = 100),
y = c(rnorm(100,-1),rnorm(100,0),rnorm(100,1)))
First we'll use ggplot::ggplot_build to capture all the calculated variables that go into plotting the violin plot:
p <- ggplot() +
geom_violin(data = dat,aes(x = factor(x),y = y))
p_build <- ggplot2::ggplot_build(p)$data[[1]]
Next, if we take a look at the source code for geom_violin we see that it does some specific transformations of this computed data frame before handing it off to geom_polygon to draw the actual outlines of the violin regions.
So we'll mimic that process and simply draw the filled polygons manually:
#This comes directly from the source of geom_violin
p_build <- transform(p_build,
xminv = x - violinwidth * (x - xmin),
xmaxv = x + violinwidth * (xmax - x))
p_build <- rbind(plyr::arrange(transform(p_build, x = xminv), y),
plyr::arrange(transform(p_build, x = xmaxv), -y))
I'm omitting a small detail from the source code about duplicating the first row in order to ensure that the polygon is closed.
Now we do two final modifications:
#Add our fill variable
p_build$fill_group <- ifelse(p_build$y >= 0,'Above','Below')
#This is necessary to ensure that instead of trying to draw
# 3 polygons, we're telling ggplot to draw six polygons
p_build$group1 <- with(p_build,interaction(factor(group),factor(fill_group)))
And finally plot:
#Note the use of the group aesthetic here with our computed version,
# group1
p_fill <- ggplot() +
geom_polygon(data = p_build,
aes(x = x,y = y,group = group1,fill = fill_group))
p_fill
Note that in general, this will clobber nice handling of any categorical x axis labels. So you will often need to do the plot using a continuous x axis and then if you need categorical labels, add them manually.

Resources