How to superimpose distribution curves on histograms using ggplot2 and lattice

How to superimpose distribution curves on histograms using ggplot2 and lattice - r

Say, I am using facet_grid() in ggplot2 to obtain 2 histograms. Now I want to superimpose these histograms with Poisson curves (having different means for the 2 histogram plots/grids) and a second curve of another distribution (for which I want to manually provide the probability function of values). How can this be done?
Constructing an example:
library(ggplot2)
value<-c(rpois(500,1.5))
group<-rep(c("A","B"),250)
data<-data.frame(value,group)
g1<-ggplot(data,aes(value))
g1+geom_histogram(aes(y=..count..),binwidth=1,position="identity")+facet_grid(.~group)
What next?
Alternatively, can it be done using the lattice package?

The easy way is to plot densities instead of counts and use stat_function()
library(ggplot2)
value<-c(rpois(500,1.5))
group<-rep(c("A","B"),250)
data<-data.frame(value,group)
ggplot(data,aes(value)) +
geom_histogram(aes(y=..density..), binwidth=1,position="identity") +
facet_grid(.~group) +
stat_function(geom = "line", fun = dpois, arg = list(lambda = 1.5), colour = "red", fill = NA, n = 9)
If you want counts then you need to convert the densities of dpois to 'counts'
ggplot(data,aes(value)) +
geom_histogram(aes(y=..count..), binwidth=1,position="identity") +
facet_grid(.~group) +
stat_function(geom = "line", fun = function(..., total){dpois(...) * total}, arg = list(lambda = 1.5, total = 250), colour = "red", fill = NA, n = 9)

When recently faced with a similar problem (comparing distributions), I wrote up some code for transparent overlapping histograms that might give you some ideas on where to start.

Related

R - update boxplot axis range after adding points

I have a boxplot which summarizes ~60000 turbidity data points into quartiles, median, whiskers and sometimes outliers. Often a few outliers are so high up that the whole plot is compressed at the bottom, and I therefor choose to omit the outliers. However, I also have added averages to the plots as points, and I want these to be plotted always. The problem is that the y-axis of the boxplot does not adjust to the added average points, so when averages are far above the box they are simply plotted outside the chart window (see X-point for 2020, but none for 2021 or 2022). Normally with this parameter, the average will be between the whisker end and the most extreme outliers. This is normal, and expected in the data.
I have tried to capture the boxplot y-axis range to compare with the average, and then setting the ylim if needed, but I just don't know how to retrieve these axis ranges.
My code is just
boxplot(...)
points(...)
and works as far as plotting the points. Just not adjusting the y-axis.
Question 1: is it not possible to get the boxplot to redraw with the new points data? I thought this was standard in R plots.
Question 2: if not, how can I dynamically adjust the y-axis range?

Let's try to show a concrete example of the problem with some simulated data:
set.seed(1)
df <- data.frame(y = c(rexp(99), 150), x = rep(c("A", "B"), each = 50))
Here, group "B" has a single outlier at 150, even though most values are a couple of orders of magnitude lower. That means that if we try to draw a boxplot, the boxes get squished at the bottom of the plot:
boxplot(y ~ x, data = df, col = "lightblue")
If we remove outliers, the boxes plot nicely:
boxplot(y ~ x, data = df, col = "lightblue", outline = FALSE)
The problem comes when we want to add a point indicating the mean value for each boxplot, since the mean of "B" lies outside the plot limits. Let's calculate and plot the means:
mean_vals <- sapply(split(df$y, df$x), mean)
mean_vals
#> A B
#> 0.9840417 4.0703334
boxplot(y ~ x, data = df, col = "lightblue", outline = FALSE)
points(1:2, mean_vals, cex = 2, pch = 16, col = "red")
The mean for "B" is missing because it lies above the upper range of the plot.
The secret here is to use boxplot.stats to get the limits of the whiskers. By concatenating our vector of means to this vector of stats and getting its range, we can set our plot limits exactly where they need to be:
y_limits <- range(c(boxplot.stats(df$y)$stats, mean_vals))
Now we apply these limits to a new boxplot and draw it with the points:
boxplot(y ~ x, data = df, outline = FALSE, ylim = y_limits, col = "lightblue")
points(1:2, mean_vals, cex = 2, pch = 16, col = "red")
For comparison, you could do the whole thing in ggplot like this:
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_boxplot(fill = "lightblue", outlier.shape = NA) +
geom_point(size = 3, color = "red", stat = "summary", fun = mean) +
coord_cartesian(ylim = range(c(range(c(boxplot.stats(df$y)$stats,
mean_vals))))) +
theme_classic(base_size = 16)
Created on 2023-02-05 with reprex v2.0.2

Best method of spatial interpolation for geographic heat/contour maps?

I'd like to use something like ggplot2 and ggmap to produce a heat map of arbitrary values such as property prices per metre squared over a geographic area at a street level (with a high resolution).
Unfortunately, the task appears to be rather difficult because while ggplot2 can produce a great density plot, it seems unable to visualise spatial data like this without prior interpolation.
For this, I've used libraries akima (gridded bivariate interpolation for irregular data) and mgcv (generalised additive models with integrated smoothness estimation), however my knowledge of interpolation methods is mediocre at best and the results I've been able to produce aren't satisfactory enough.
Consider the following example:
Data
library(ggplot2)
library(ggmap)
## data simulation
set.seed(1945)
df <- tibble(x = rnorm(500, -0.7406, 0.03),
y = rnorm(500, 51.9976, 0.03),
z = abs(rnorm(500, 2000, 1000)))
Map, scatterplot, density plot
## ggmap
map <- get_map("Bletchley Park, Bletchley, Milton Keynes", zoom = 13, source = "stamen", maptype = "toner-background")
q <- ggmap(map, extent = "device", darken = .5)
## scatterplot over map
q + geom_point(aes(x, y), data = df, colour = z)
## classic density heat map
q +
stat_density2d(aes(x=x, y=y, fill=..level..), data=df, geom="polygon", alpha = .2) +
geom_density_2d(aes(x=x, y=y), data=df, colour = "white", alpha = .4) +
scale_fill_distiller(palette = "Spectral")
As you can see, the data are rather dense over the chosen area and the density heat map looks great with round edges and closed curves (except for some of the outermost layers).
Interpolation and plotting using akima
## akima interpolation
library(akima)
df_akima <-interp2xyz(interp(x=df$x, y=df$y, z=df$z, duplicate="mean", linear = T,
xo=seq(min(df$x), max(df$x), length=200),
yo=seq(min(df$y), max(df$y), length=200)), data.frame=TRUE)
## akima plot
q +
geom_tile(aes(x = x, y = y, fill = z), data = df_akima, alpha = .4) +
stat_contour(aes(x = x, y = y, z = z, fill = ..level..), data = df_akima, geom = 'polygon', alpha = .4) +
geom_contour(aes(x = x, y = y, z = z), data = df_akima, colour = 'white', alpha = .4) +
scale_fill_distiller(palette = "Spectral", na.value = NA)
This produces a dense grid of interpolated values (to ensure a sufficient resolution) and while the tile plot underneath is acceptable, the contour plots are too ragged and many of the curves aren't closed.
Non-linear interpolation using linear = F is smoother, but apparently sacrifices resolution and goes wild with the numbers (negative values of z).
Interpolation and plotting using mgcv
## mgcv interpolation
library(mgcv)
gam <- gam(z ~ s(x, y, bs = 'sos'), data = df)
df_mgcv <- data.frame(expand.grid(x = seq(min(df$x), max(df$x), length=200),
y = seq(min(df$y), max(df$y), length=200)))
resp <- predict(gam, df_mgcv, type = "response")
df_mgcv$z <- resp
## mgcv plot
q +
geom_tile(aes(x = x, y = y, fill = z), data = df_mgcv, alpha = .4) +
stat_contour(aes(x = x, y = y, z = z, fill = ..level..), data = df_mgcv, geom = 'polygon', alpha = .4) +
geom_contour(aes(x = x, y = y, z = z), data = df_mgcv, colour = 'white', alpha = .4) +
scale_fill_distiller(palette = "Spectral", na.value = NA)
The same process using mgcv results in a nice and smooth plot, but the resolution is much lower and practically all curves aren't closed.
Questions
Could you please suggest a better method or modify my attempt to obtain a plot similar to the first one (clean, connected, and smooth lines with high resolution)?
Is it possible to close the curves, e.g. in the last plot (the shaded area should be computed beyond the image boundaries)?
Thank you for your time!

The problem with your maps is not the interpolation method you're using, but the way ggplot displays density lines. Here's an answer to this: Remove gaps in a stat_density2d ggplot chart without modifying XY limits.
The density lines go beyond the map, so any polygon that goes outside the plot area is rendered inappropriately (ggplot will close the polygon using the next point of the correspondent level). This does not show up much on your first map because the interpolation resolution is low.
The trick proposed by Andrew is to first expand the plot area, so that the density lines are rendered correctly, then cut off the display area to hide the extra space. Since I tested his solution with your first example, here's the code:
q +
stat_density2d(
aes(x = x, y = y, fill = ..level..),
data = df,
geom = "polygon",
alpha = .2,
color = "white",
bins = 20
) +
scale_fill_distiller(
palette = "Spectral"
) +
xlim(
min(df$x) - 10^-5,
max(df$x) + 10^-5
) +
ylim(
min(df$y) - 10^-3,
max(df$y) + 10^-3
) +
coord_equal(
expand = FALSE,
xlim = c(-.778, -.688),
ylim = c(51.965, 52.03)
)
The only differences is that I used min()- / max() + instead of fixed numbers and coord_equal to ensure the map wasn't distorted. In addition, I manually specified a greater number of levels (using bin), since by increasing the plot area, stat_density automatically chooses a lower resolution.
As for the best interpolation method, this depends on your objective and the type of data you have. The question is not what is the best method for your map, but what is the best method for your data. This is a very broad issue, out of scope for this space. But here's a good guide: http://www.rspatial.org/analysis/rst/4-interpolation.html
For general ideas on how to make good maps in R using ggplot: http://spatial.ly/r/

Sorry, I can't run your example at the moment to provide details. But try autoKrige() from automap package.
Kriging is a great method for interpolation. Just be sure that your data fits the requisitions. Here's a good guide:
https://gisgeography.com/kriging-interpolation-prediction/

Customize breaks on a color gradient legend using base R

Here is a sample script using random numbers instead of real elevation data.
library(gridExtra)
library(spatstat) #im function
elevation <- runif(500, 0, 10)
B <- matrix(elevation, nrow = 20, ncol = 25)
Elevation_Map <- im(B)
custom <- colorRampPalette(c("cyan","green", "yellow", "orange", "red"))
plot(Elevation_Map, col = custom(10), main = NULL)
This is the plot and legend that I get:
This is the legend that I am trying to recreate in R (this one made in Word):
I know this is possible and its probably a simple solution but I've tried using some examples I found online to no avail.
This plot (with real elevation data) is an art piece that will be hung in a gallery, with the elevation plot on 1 board and the legend on a separate board. I tried to get R to plot just the plot without the legend using
plot(Elevation_Map, col = custom(10), main = NULL, legend = NULL)
like I have in the past but for some reason it always plots the legend with the plot. As of right now I'm planning on just cropping the .pdf into 2 separate files to achieve this.

Here are two ways of doing it using other packages:
# example data, set seed to reproduce.
set.seed(1); elevation <- runif(500, 0, 10)
B <- matrix(elevation, nrow = 20, ncol = 25)
#Elevation_Map <- im(B)
custom <- colorRampPalette(c("cyan","green", "yellow", "orange", "red"))
1) Using fields package, image.plot(), it is same "base" graphics::image.default() plot but with more arguments for customisation (but couldn't remove the ticks from legend):
library(fields)
image.plot(B, nlevel = 10, col = custom(10),
breaks = 1:11,
lab.breaks = c("Low Elevation", rep("", 9), "High Elevation"),
legend.mar = 10)
2) Using ggplot package, geom_raster function:
library(ggplot2)
library(reshape) # convert matrix to long dataframe: melt
B_melt <- reshape2::melt(B)
head(B_melt)
ggplot(B_melt, aes(X1, X2, fill = value)) +
geom_raster() +
theme_void() +
scale_fill_gradientn(name = element_blank(),
breaks = c(1, 9),
labels = c("Low Elevation", "High Elevation"),
colours = custom(10))

The code in the original post is using the im class from the spatstat package. The plot command is dispatched to plot.im. Simply look at help(plot.im) to figure out how to control the colour ribbon. The relevant argument is ribargs. Here is a solution:
plot(Elevation_Map, col=custom(10), main="",
ribargs=list(at=Elevation_Map$yrange,
labels=c("Low Elevation", "High Elevation"),
las=1))

Colours across Plots / Heatmaps in R

I am creating a number of heatmaps in R, but I am having problems when it comes to keeping the colour scale consistent across graphs.
I find that the colours are scaled within a graph, is there a way to make colours consistent across graphs? Ie. So that that colour difference between a value of 0.4 and 0.5 is always the same?
Code Example:
set.seed(123)
d1 = matrix(rnorm(9, mean = 0.2, sd = 0.1), ncol = 3)
d2 = matrix(rnorm(9, mean = 0.8, sd = 0.1), ncol = 3)
mat = list(d1, d2)
for(m in mat)
heatmap(m, Rowv = NA ,Colv = NA)
You'll note in the example that cell (2,3) the first graph is similar to cell (1,3) in the second, despite being ~0.8 different

Here's a way to do it with ggplot2, if you're open to not using base graphics:
library(reshape2)
library(ggplot2)
# Set common limits for color scale
limits = range(unlist(mat))
Here's the code for two separate graphs. The last line of code for each graph ensures that they use the same z limits for setting the colors:
ggplot(melt(mat[[1]]), aes(Var1, Var2, fill=value)) +
geom_tile() +
scale_fill_continuous(limits=limits)
ggplot(melt(mat[[2]]), aes(Var1, Var2, fill=value)) +
geom_tile() +
scale_fill_continuous(limits=limits)
Another option is to plot both heatmaps in a single graph using facetting, which automatically ensures both graphs are on the same color scale:
ggplot(melt(mat), aes(Var1, Var2, fill=value)) +
geom_tile() +
facet_grid(. ~ L1)
I've used the default colors here, but for either approach you can set the color scale to be anything you wish. For example:
ggplot(melt(mat), aes(Var1, Var2, fill=value)) +
geom_tile() +
facet_grid(. ~ L1) +
scale_fill_gradient(low="red", high="green")

You could use the image function directly (heatmap uses image), though it will require some extra formatting to match the output of heatmap. You can use zlim to set the color range. Quoting from the ?image page:
the minimum and maximum z values for which colors should be plotted,
defaulting to the range of the finite values of z. Each of the given
colors will be used to color an equispaced interval of this range. The
midpoints of the intervals cover the range, so that values just
outside the range will be plotted.
# define zlim min and max for all the plots
minz = Reduce(min, mat)
maxz = Reduce(max, mat)
for(m in mat) {
image( m, zlim = c(minz, maxz), col = heat.colors(20))
}
To get closer to the formatting produced by heatmap, you can just reuse some code from the heatmap function:
for(m in mat) {
labCol = dim(m)[2]
labRow = dim(m)[1]
image(seq_len(labCol), seq_len(labRow), m, zlim = c(minz, maxz),
col = heat.colors(20), axes = FALSE, xlab = "", ylab = "",
xlim = 0.5 + c(0, labCol), ylim = 0.5 + c(0, labRow))
axis(1, 1L:labCol, labels = seq_len(labCol), las = 2, line = -0.5, tick = 0)
axis(4, 1L:labRow, labels = seq_len(labRow), las = 2, line = -0.5, tick = 0)
}
Using the breaks argument to image is another option. It allows more flexibility than zlim in setting the breakpoints for colors. Quoting from the help page, breaks is
a set of finite numeric breakpoints for the colours: must have one
more breakpoint than colour and be in increasing order. Unsorted
vectors will be sorted, with a warning.

How to adjust the y-axis of bar plot in R using only the barplot function

Using this example:
x<-mtcars;
barplot(x$mpg);
you get a graph that is a lot of barplots from (0 - 30).
My question is how can you adjust it so that the y axis is (10-30) with a split at the bottom indicating that there was data below the cut off?
Specifically, I want to do this in base R program using only the barplot function and not functions from plotrix (unlike the suggests already provided). Is this possible?

This is not recommended. It is generally considered bad practice to chop off the bottoms of bars. However, if you look at ?barplot, it has a ylim argument which can be combined with xpd = FALSE (which turns on "clipping") to chop off the bottom of the bars.
barplot(mtcars$mpg, ylim = c(10, 30), xpd = FALSE)
Also note that you should be careful here. I followed your question and used 0 and 30 as the y-bounds, but the maximum mpg is 33.9, so I also clipped the top of the 4 bars that have values > 30.
The only way I know of to make a "split" in an axis is using plotrix. So, based on
Specifically, I want to do this in base R program using only the barplot function and not functions from plotrix (unlike the suggests already provided). Is this possible?
the answer is "no, this is not possible" in the sense that I think you mean. plotrix certainly does it, and it uses base R functions, so you could do it however they do it, but then you might as well use plotrix.
You can plot on top of your barplot, perhaps a horizontal dashed line (like below) could help indicate that you're breaking the commonly accepted rules of what barplots should be:
abline(h = 10.2, col = "white", lwd = 2, lty = 2)
The resulting image is below:
Edit: You could use segments to spoof an axis break, something like this:
barplot(mtcars$mpg, ylim = c(10, 30), xpd = FALSE)
xbase = -1.5
xoff = 0.5
ybase = c(10.3, 10.7)
yoff = 0
segments(x0 = xbase - xoff, x1 = xbase + xoff,
y0 = ybase-yoff, y1 = ybase + yoff, xpd = T, lwd = 2)
abline(h = mean(ybase), lwd = 2, lty = 2, col = "white")
As-is, this is pretty fragile, the xbase was adjusted by hand as it will depend on the range of your data. You could switch the barplot to xaxs = "i" and set xbase = 0 for more predictability, but why not just use plotrix which has already done all this work for you?!
ggplot In comments you said you don't like the look of ggplot. This is easily customized, e.g.:
library(ggplot2)
ggplot(x, aes(y = mpg, x = id)) +
geom_bar(stat = "identity", color = "black", fill = "gray80", width = 0.8) +
theme_classic()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to superimpose distribution curves on histograms using ggplot2 and lattice - r

When recently faced with a similar problem (comparing distributions), I wrote up some code for transparent overlapping histograms that might give you some ideas on where to start.

Related

R - update boxplot axis range after adding points

Best method of spatial interpolation for geographic heat/contour maps?

Customize breaks on a color gradient legend using base R

Colours across Plots / Heatmaps in R

How to adjust the y-axis of bar plot in R using only the barplot function

Categories

Resources