Methods for doing heatmaps, level / contour plots, and hexagonal binning - r

The options for 2D plots of (x,y,z) in R are a bit numerous. However, grappling with the options is a bit of a challenge, especially in the case that all three are continuous.
To clarify the problem (and possibly assist in explaining why I might be getting tripped up with contour or image), here is a possible classification scheme:
Case 1: The value of z is not provided but is a conditional density based on values in (x,y). (Note: this is essentially relegating the calculation of z to a separate function - a density estimation. Something still has to use the output of that calculation, so allowing for arbitrary calculations would be nice.)
Case 2: (x,y) pairs are unique and regularly spaced. This implies that only one value of z is provided per (x,y) value.
Case 3: (x,y) pairs are unique, but are continuous. Coloring or shading is still determined by only 1 unique z value.
Case 4: (x,y) pairs are not unique, but are regularly spaced. Coloring or shading is determined by an aggregation function on the z values.
Case 5: (x,y) pairs are not unique, are continuous. Coloring / shading must be determined by an aggregation function on the z values.
If I am missing some cases, please let me know. The case that interests me is #5. Some notes on relationships:
Case #1 seems to be well supported already.
Case #2 is easily supported by heatmap, image, and functions in ggplot.
Case #3 is supported by base plot, though use of a color gradient is left to the user.
Case #4 can become case #2 by use of a split & apply functionality. I have done that before.
Case #5 can be converted to #4 (and then #2) by using cut, but this is inelegant and boxy. Hex binning may be better, though that does not seem to be easily conditioned on whether there is a steep gradient in the value of z. I'd settle for hex binning, but alternative aggregation functions are quite welcome, especially if they can utilize the z values.
How can I do #5? Here is code to produce a saddle, though the value of spread changes the spread of the z value, which should create differences in plotting gradients.
N = 1000
spread = 0.6 # Vals: 0.6, 3.0
set.seed(0)
rot = matrix(rnorm(4), ncol = 2)
mat0 = matrix(rnorm(2 * N), ncol = 2)
mat1 = mat0 %*% rot
zMean = mat0[,2]^2 - mat0[,1]^2
z = rnorm(N, mean = zMean, sd = spread * median(abs(zMean)))
I'd like to do something like hexbin, but I've banged on this with ggplot and haven't made much progress. If I can apply an arbitrary aggregation function to the z values in a region, that would be even better. (The form of such a function might be like plot(mat1, colorGradient = f(z), aggregation = "bin", bins = 50).)
How can I do this in ggplot or another package? I am happy to make this question a community wiki question (or other users can, by editing it enough times). If so, one answer per post, please, so that we can focus on, say, ggplot, levelplot, lattice, contourplot (or image), and other options, if they exist.
Updates 1: The volcano example is a good example of case #3: the data is regularly spaced (it could be lat/long), with one z value per observation. A topographic map has (latitude, longitude, altitude), and thus one value per location. Suppose one is obtaining weather (e.g. rainfall, windspeed, sunlight) over many days for many randomly placed sensors: that is more akin to #5 than to #3 - we may have lat & long, but the z values can range quite a bit, even for the same or nearby (x,y) values.
Update 2: The answers so far, by DWin, Kohske, and John Colby are all excellent. My actual data set is a small sample of a larger set, but at 200K points it produces interesting results. On the (x,y) plane, it is has very high density in some regions (thus, overplotting would occur in those areas) and much lower density or complete absence in other regions. With John's suggestion via fields, I needed to subsample the data for Tps to work out (I'll investigate if I can do it without subsampling), but the results are quite interesting. Trying rms/Hmisc (DWin's suggestion), the full 200K points seem to work out well. Kohske's suggestion is quite good, and, as the data is transformed into a grid before plotting, there's no issue with the number of input data points. It also gives me greater flexibility to determine how to aggregate the z values in the region. I am not yet sure if I will use mean, median, or some other aggregation.
I also intend to try out Kohske's nice example of mutate + ddply with the other methods - it is a good example of how to get different statistics calculated over a given region.
Update 3: The different methods are distinct and several are remarkable, though there isn't a clear winner. I selected John Colby's answer as the first. I think I will use that or DWin's method in further work.

I've had great luck with the fields package for this type of problem. Here is an example using Tps for thin plate splines:
EDIT: combined plots and added standard error
require(fields)
dev.new(width=6, height=6)
set.panel(2,2)
# Plot x,y
plot(mat1)
# Model z = f(x,y) with splines
fit = Tps(mat1, z)
pred = predict.surface(fit)
# Plot fit
image(pred)
surface(pred)
# Plot standard error of fit
xg = make.surface.grid(list(pred$x, pred$y))
pred.se = predict.se(fit, xg)
surface(as.surface(xg, pred.se))

I generally use the rms/Hmisc package combination. This is a linear regression analysis (function ols) using crossed cubic spline terms whose plotted output closely resembles the fields example offered:
dfrm <- data.frame(z=z, xcor = mat1[,1], ycor=mat1[,2])
require(rms) # will automatically load Hmisc which needs to have been installed
lininterp <- ols(z ~ rcs(xcor,3)*rcs(ycor,3), data=dfrm)
ddI <- datadist(dfrm)
options(datadist="ddI")
bplot(Predict(lininterp, xcor, ycor)) # Plot not shown
perim <- with(dfrm, perimeter(xcor, ycor))
bplot(Predict(lininterp, xcor, ycor), perim=perim)
# Plot attached after converting to .png
You can also see a method that does not rely on regression estimates of the 3-D surface in second part of my answer to this question: Using color as the 3rd dimension
The plotting paradigm is lattice and you can also get contour plots as well as this pretty levelplot. If you want the predicted values at an iterior point, then you can get that with the Predict function applied to the fit-object.

There is a panel.2dsmoother function in the latticeExtra package:
library(lattice)
library(latticeExtra)
df <- data.frame(mat1, z)
names(df)[c(1,2)] <- c('x', 'y')
levelplot(z ~ x * y, data = df, panel = panel.2dsmoother, contour=TRUE)
According to its help page "the smoothing model is constructed (approximately) as method(form, data = list(x=x, y=y, z=z), {args}) [...] This should work with any model function that takes a formula argument, and has a predict method argument".

Probably the question can be divided into two parts. The first is aggregating the data, and the second is visualizing it.
fields package, as #John shows, can do these things at one time.
In ggplot2, if aggregation is simply count of the data, stat_bin2d is available.
Anyway, if you want to your own aggregate function, maybe something like this will help:
df <- data.frame(x = mat1[,1], y = mat1[,2], z = z)
Nx <- 10 # nubmer of bins for x
Ny <- 4 # number of bins for y
# create a data.
df2 <- mutate(ddply(df, .(x = cut(x, Nx), y = cut(y, Ny)), summarise,
Mean = mean(z),
Var = var(z)),
xmin = as.numeric( sub("\\((.+),.*", "\\1", x)),
xmax = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", x)),
ymin = as.numeric( sub("\\((.+),.*", "\\1", y)),
ymax = as.numeric( sub("[^,]*,([^]]*)\\]", "\\1", y)),
xint = as.numeric(x),
yint = as.numeric(y))
# then, visualize
ggplot(df2, aes(xint, yint, xmin = xmin, ymin = ymin, xmax = xmax, ymax = ymax, fill = Mean)) +
geom_tile(stat = "identity")
ggplot(df2, aes(xint, yint, xmin = xmin, ymin = ymin, xmax = xmax, ymax = ymax, fill = Var)) +
geom_tile(stat = "identity")

Related

How to fix unstable y-positions for geom_jitter() for ggplot2 in R?

I'm doing a common R ggplot2 graph with boxplot: boxplots supplemented individual samples as points shown by geom_jitter(), to show the individual sample positions and numbers in each group. Normally I have not noticed a problem, but with some recent data, I've noticed substantial inaccuracy and variation in the y position of the jitter. However, the boxplot stays stable with respect to the Y, and so does geom_point() when used to show the same points as jitter is plotting. Error is likely not noticeable when you have many data points, but if required to do something with 5-10 samples in a group, it can produce an obvious error that makes a plot that may mislead you, if you were not aware of the issue.
I first thought this may have always happened and I didn't notice, so I made some random numbers and made a ggplot with geom_jitter(), but at first the problem disappeared. Some example data and plots are given to show the normal and problematic cases.
Data generation and plotting that worked as expected:
df <- data.frame("X" = rep("X", 5), "Y" = rnorm(5, 100, 30))
check the plot:
library(ggplot2)
ggplot(df, aes(X, Y)) + geom_boxplot() + geom_jitter(col = "red") + geom_point(col = "blue")
The red and blue dots are almost exactly aligned, and you can just watch the plot come in RStudio preview if you repeat the code 5 times and not notice variation in the jitter point y-position (only horizontally along the X-axis, as expected). In a problematic case like below, you quickly see the y-axis point variation, especially because it sometimes shifts the range of the y-axis.
With more variation in random numbers, I found a difference visible between the red and blue points, which varied each time of plotting the same data:
df <- data.frame("X" = rep("X", 5), "Y" = rnorm(5, 100, 400))
The actual numbers to get this problem were:
X Y
1 X 610.78026
2 X -38.58905
3 X -196.00943
4 X 94.37797
5 X 415.58417
In my result, the lowest point, -196, sometimes was about -170, sometimes about -250. The range of the y-axis shifts each time. It's similar to the problem I had happen with real data.
I found with other testing of data having more variance, or a larger range between points, did not explain occurrence variability of the jitter y-position. In some cases with more variance, geom_jitter() again produced near perfect y-positions. So I wondered if it may have something to do with trouble mapping with certain plot areas used by ggplot2. I thought to test that by forcing ggplot to keep the same ylimit using ylim(-206, 621) but it failed to stop the area with the above problematic case. It gives a mysterious, yet consistent error of: "Warning message: Removed 1 rows containing missing values (geom_point)."
(In the corresponding plot, it lost the red jitter point for the 610.7 value, despite enough pixel space in the plot preview window for about 10 more points between the blue point and the top of the graph. In another attempt, 2 jitter points get lost, because the bottom sometimes goes past the lower limit).
A roundabout solution would be to make random points for the X group, all keeping the same Y and group identity, but it's not efficient. When non-numerical groups are used on X, I found it will have a numerical position of 1 for any labels being added. Adding the following to the last dataframe gives the proper appearance + geom_point(aes(x= rnorm(5, 1, .2), y = Y), col = "yellow") - but that would become quite cumbersome if there are many groups if there is not some way to automatically get the correct X position for groups of boxplots.
To solve the problem, any input on what the cause of it is would be a great help.
It sounds like you do not want the default geom_jitter behavior, which adds a uniformly distributed amount of noise separately to the x and y value before plotting, by default "40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins."
For a continuous variable like yours "resolution" is "the smallest non-zero distance between adjacent values."
Try this:
geom_jitter(col = "red", height = 0) +
That will tell ggplot you want no noise applied to the y values before plotting.
Another approach would be to add noise yourself before the plotting step, giving you the ability to control its distribution and range specifically.
e.g. instead of having the jittering fill a uniform rectangle: ...
library(dplyr)
tibble(x = rep(1:2, each = 1000),
y = rep(3:4, each = 1000)) -> point_data
ggplot(point_data, aes(x,y)) + geom_jitter()
We could add whatever noise function we want. Here, for no particular reason, I make donuts around the real data, and compare that to the default jitter:
point_data %>%
mutate(angle = runif(2000, 0, 2*pi),
dist = rnorm(2000, 0.3, 0.05),
x2 = x + dist*cos(angle),
y2 = y + dist*sin(angle)) %>%
ggplot() +
geom_jitter(aes(x,y), color = "red", alpha = 0.2) +
geom_point(aes(x2,y2))

How to color points and provide a figure key in a QQ plot using qqplotr::stat_qq_point?

I am using qqplotr::stat_qq_point() which is an "add-on" to ggplot2 to display a quantile-quantile plot. I would like to color the points by a grouping factor and also provide a figure key. I would also like to include a 95% CI band and fit line. One coherent way to do these tasks is to use ggplot2, stat_qq_band (CI from qqplotr), stat_qq_line (best fit line in both ggplot2 and qqplotr) and stat_qq_point (plotted qq points from qqplotr). However, I cannot work out how to present the figure key.
In my code (below) have omitted the 95% CI and fit line, as they can be easily added afterwards. The code provides correctly colored points, but no legend.
I know that my code is very kludgy. If I understand correctly, aes in stat_qq_point only accepts thesample parameter and does not accept colour. This means that the normal strategies to provide colour and figure legends for plots of data points are not available.
I found a very similar question here
However, the previous question and answer is a little bit of a "hack". The strategy proposed in the previous question is to not use stat_qq_point but rather to calculate the quantiles separately using the base function qqnorm. The ggplot function geom_point can then be used with its attendant abilities to color points individually and provide a figure key.
sample data from the other question:
set.seed(1001)
N <- 1000
G <- 10
dd <- data_frame(x=runif(N),
f=factor(sample(1:G,size=N,replace=TRUE)),
y=rnorm(N)+2*x+as.numeric(f))
m1 <- lm(y~x,data=dd)
dda <- cbind(augment(m1),f=dd$f)
Using these data with my approach is as follows:
gg_color_hue <- function(n) {
hues = seq(15, 375, length = n + 1)
hcl(h = hues, l = 65, c = 100)[1:n]
}
n = length(unique(dda$f))
colores_1 = gg_color_hue(n)
dda$Color <- colores_1[dda$f]
dda$theory_quant=qqnorm(dda$.resid,plot.it=FALSE)$x
dda$sample_quant=qqnorm(dda$.resid,plot.it=FALSE)$y
library(qqplotr)
ggplot() +
stat_qq_point(
data = dda,
mapping = aes(sample = .resid),
colour=dda$Color
) +
scale_colour_manual(
values=unique(dda$Colour),
name ="f",
labels=c(1:10)
) +
guides(
colour = guide_legend(override.aes = list(fill=NA),ncol=2,byrow=TRUE)
) +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles")
And yields the following, which has colored points but no figure key:
So, in sum, I would like an approach to draw qq plots with points colored by groupings, a best fit line, 95% CI and figure key using ggplot2 and its "add-on" qqplotr. But perhaps it is simply not possible to do all of these tasks uing qqplotr.
I thought it would be of value to leave my question on SO, in case any better approaches have emerged in the previous 1.8 years since the previous question was asked.
Many thanks for your help!

How to plot density of points in one dimension with different factors in ggplot2

I am attempting to place individual points on a plot using ggplot2, however as there are many points, it is difficult to gauge how densely packed the points are. Here, there are two factors being compared against a continuous variable, and I want to change the color of the points to reflect how closely packed they are with their neighbors. I am using the geom_point function in ggplot2 to plot the points, but I don't know how to feed it the right information on color.
Here is the code I am using:
s1 = rnorm(1000, 1, 10)
s2 = rnorm(1000, 1, 10)
data = data.frame(task_number = as.factor(c(replicate(100, 1),
replicate(100, 2))),
S = c(s1, s2))
ggplot(data, aes(x = task_number, y = S)) + geom_point()
Which generates this plot:
However, I want it to look more like this image, but with one dimension rather than two (which I borrowed from this website: https://slowkow.com/notes/ggplot2-color-by-density/):
How do I change the colors of the first plot so it resembles that of the second plot?
I think the tricky thing about this is you want to show the original values, and evaluate the density at those values. I borrowed ideas from here to achieve that.
library(dplyr)
data = data %>%
group_by(task_number) %>%
# Use approxfun to interpolate the density back to
# the original points
mutate(dens = approxfun(density(S))(S))
ggplot(data, aes(x = task_number, y = S, colour = dens)) +
geom_point() +
scale_colour_viridis_c()
Result:
One could, of course come up with a meausure of proximity to neighbouring values for each value... However, wouldn't adjusting the transparency basically achieve the same goal (gauging how densely packed the points are)?
geom_point(alpha=0.03)

How to do a 2d heatmap with color smoothing ... or a density plot from absolute values?

I've done the rounds here and via google without a solution, so please help if you can.
I'm looking to create something like this : painSensitivityHeatMap using ggplot2
I can create something kinda similar using geom_tile, but without the smoothing between data points ... the only solution I have found requires a lot of code and data interpolation. Not very elegant, me thinks.uglySolutionUsingTile
So I'm thinking, I could coerce the density2d plots to my purposes instead by having the plot use fixed values rather than a calculated data-point density -- much in the same way that stat='identity' can be used in histograms to make them represent data values, rather than data counts.
So a minimal working example:
df <- expand.grid(letters[1:5], LETTERS[1:5])
df$value <- sample(1:4, 25, replace=TRUE)
# A not so pretty, non-smooth tile plot
ggplot(df, aes(x=Var1, y=Var2, fill=value)) + geom_tile()
# A potentially beautiful density2d plot, except it fails :-(
ggplot(df, aes(x=Var1, y=Var2)) + geom_density2d(aes(color=..value..))
This took me a little while, but here is a solution for future reference
A solution using idw from the gstat package and spsample from the sp package.
I've written a function which takes a dataframe, number of blocks (tiles) and a low and upper anchor for the colour scale.
The function creates a polygon (a simple quadrant of 5x5) and from that creates a grid of that shape.
In my data, the location variables are ordered factors -- therefor I unclass them into numbers (1-to-5 corresponding to the polygon-grid) and convert them to coordinates -- thus converting the tmpDF from a datafra to a spatial dataframe. Note: there are no overlapping/duplicate locations -- i.e 25 observations corresponding to the 5x5 grid.
The idw function fills in the polygon-grid (newdata) with inverse-distance weighted values ... in other words, it interpolates my data to the full polygon grid of a given number of tiles ('blocks').
Finally I create a ggplot based on a color gradient from the colorRamps package
painMapLumbar <- function(tmpDF, blocks=2500, lowLimit=min(tmpDF$value), highLimit=max(tmpDF$value)) {
# Create polygon to represent the lower back (lumbar)
poly <- Polygon(matrix(c(0.5, 0.5,0.5, 5.5,5.5, 5.5,5.5, 0.5,0.5, 0.5), ncol=2, byrow=TRUE))
# Create a grid of datapoints from the polygon
polyGrid <- spsample(poly, n=blocks, type="regular")
# Filter out the data for the figure we want
tmpDF <- tmpDF %>% mutate(x=unclass(x)) %>% mutate(y=unclass(y))
tmpDF <- tmpDF %>% filter(y<6) # Lumbar region only
coordinates(tmpDF) <- ~x+y
# Interpolate the data as Inverse Distance Weighted
invDistanceWeigthed <- as.data.frame(idw(formula = value ~ 1, locations = tmpDF, newdata = polyGrid))
p <- ggplot(invDistanceWeigthed, aes(x=x1, y=x2, fill=var1.pred)) + geom_tile() + scale_fill_gradientn(colours=matlab.like2(100), limits=c(lowLimit,highLimit))
return(p)
}
I hope this is useful to someone ... thanks for the replies above ... they helped me move on.

Controlling alpha in ggparcoord (from GGally package)

I am trying to build from a question similar to mine (and from which I borrowed the self-contained example and title inspiration). I am trying to apply transparency individually to each line of a ggparcoord or somehow add two layers of ggparcoord on top of the other. The detailed description of the problem and format of data I have for the solution to work is provided below.
I have a dataset with thousand of lines, lets call it x.
library(GGally)
x = data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
After clustering this data I also get a set of 5 lines, let's call this dataset y.
y = data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
In order to see the centroids y overlaying x I use the following code. First I add y to x such that the 5 rows are on the bottom of the final dataframe. This ensures ggparcoord will put them last and therefore stay on top of all the data:
df <- rbind(x,y)
Next I create a new column for df, following the question advice I referred such that I can color differently the centroids and therefore can tell it apart from the data:
df$cluster = "data"
df$cluster[(nrow(df)-4):(nrow(df))] <- "centroids"
Finally I plot it:
p <- ggparcoord(df, columns=1:4, groupColumn=5, scale="globalminmax", alphaLines = 0.99) + xlab("Sample") + ylab("log(Count)")
p + scale_colour_manual(values = c("data" = "grey","centroids" = "#94003C"))
The problem I am stuck with is from this stage and onwards. On my original data, plotting solely x doesn't lead to much insight since it is a heavy load of lines (on this data this is equivalent to using ggparcoord above on x instead of df:
By reducing alphaLines considerably (0.05), I can naturally see some clusters due to the overlapping of the lines (this is again running ggparcoord on x reducing alphaLines):
It makes more sense to observe the centroids added to df on top of the second plot, not the first.
However, since everything it is on a single dataframe, applying such a high value for alphaLine makes the centroid lines disappear. My only option is then to use ggparcoord (as provided above) on df without decreasing the alphaValue:
My goal is to have the red lines (centroid lines) on top of the second figure with very low alpha. There are two ways I thought so far but couldn't get it working:
(1) Is there any way to create a column on the dataframe, similar to what is done for the color, such that I can specify the alpha value for each line?
(2) I originally attempted to create two different ggparcoords and "sum them up" hoping to overlay but an error was raised.
The question may contain too much detail, but I thought this could motivate better the applicability of the answer to serve the interest of other readers.
The answer I am looking for would use the provided data variables on the current format and generate the plot I am looking for. Better ways to reconstruct the data is also welcomed, but using the current structure is preferred.
In this case I think it easier to just use ggplot, and build the graph yourself. We make slight adjustments to how the data is represented (we put it in long format), and then we make the parallel coordinates plot. We can now map any attribute to cluster that you like.
library(dplyr)
library(tidyr)
# I start the same as you
x <- data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
y <- data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
# I find this an easier way to combine the two data.frames, and have an id column
df <- bind_rows(data = x, centroids = y, .id = 'cluster')
# We need to add id's, so we know which points to connect with a line
df$id <- 1:nrow(df)
# Put the data into long format
df2 <- gather(df, 'column', 'value', a:d)
# And plot:
ggplot(df2, aes(column, value, alpha = cluster, color = cluster, group = id)) +
geom_line() +
scale_colour_manual(values = c("data" = "grey", "centroids" = "#94003C")) +
scale_alpha_manual(values = c("data" = 0.2, "centroids" = 1)) +
theme_minimal()

Resources