GGPLOT relative frequency histogram - r

I am trying to plot the relative frequency of 1D data from 3 clusters. What I want is a single histogram that uses color to distinguish between the 3 clusters, and I want the height of each bin to represent the relative frequency of that value range for a particular cluster.
The code is as follows:
library(mvtnorm)
library(gtools)
library(ggplot2)
K = 3 # number of clusters
p_p = c(0.25, 0.25, 0.5) # population weights
theta_p = c(2, 5, 15) # population gamma params - shape
phi_p = c(2,2, 5) # population gamma params - scale
N_p = c(25, 25, 50) # sample size within each cluster
set.seed(1) # set seed so that the results are the same each time
y <- numeric()
## We will now sample data from all three clusters
y[1:N_p[1]] <- rgamma(N_p[1], theta_p[1], phi_p[1])
y[(N_p[1]+1): (N_p[1]+N_p[2])] <- rgamma(N_p[2], theta_p[2], phi_p[2])
y[(N_p[1]+N_p[2]+1): sum(N_p)] <- rgamma(N_p[3], theta_p[3], phi_p[3])
Data = data.frame(y = y, source = as.factor(c(rep(1,25), rep(2,25), rep(3,50))))
ggplot(Data, aes(x=y, color = source))+
geom_histogram(aes(y=..count../sum(..count..)),fill="white", position="dodge", binwidth = 0.5) +
theme(legend.position="top")+labs(title="Samples against Theoretical Dist",y="Frequency", x="Sample Value")
length(which(y[1:25]<=0.5))/length(y)
length(which(y[1:25]<=0.5))/length(y[0:25])
Now, what I want is for the first red histogram bar to have a height equal to length(which(y[1:25]<=0.5))/length(y[0:25]). I would understand if i was getting length(which(y[1:25]<=0.5))/length(y) instead, and I could work around that.
However, I'm getting a height of around 0.12, which doesn't match either of these values and has me thinking I am completely misunderstanding ..count.. and sum(..count..).

The issue isn't with your understanding of ..count.. but in your assumption of how binwidth works. You have assumed that setting it to 0.5 will set the breaks at 0, 0.5, 1, 1.5 etc, but in fact it sets it at the lowest value of the range of your data. So in fact, the height of your first bar is length(which(y[1:25] <= (min(y) + 0.5)))/length(y), which is 13.
You can specify breaks in geom_histogram to work round this limitation:
ggplot(Data, aes(x = y, color = source)) +
geom_histogram(aes(y = stat(count)/length(y)), fill = "white",
position = "dodge", breaks = seq(0, 6, 0.5)) +
theme(legend.position = "top" +
labs(title = "Samples against Theoretical Dist",
y = "Frequency", x = "Sample Value")
Now each bar is 1/100th of the count since the vector is 100 long.

Related

Advice/ on how to plot side by side histograms with line graph going through in ggplot2

I'm currently finishing off my Masters project and need to include some graphics for the write-up. Without boring you too much, I have some data which is associated with AR(1) parameters ranging from 0.1 to 0.9 by 0.1 increments. As such I thought of doing a faceted histogram like the one below (worry not about the hideous fruit salad of colours, it will not be used).
I used this code.
ggplot(opt_lens_geom,aes(x=l_1024,fill=factor(rho))) + geom_histogram()+coord_flip()+facet_grid(.~rho,scales = "free_x")
I also would like to draw a trend line for the median values since the AR(1) parameter is continuous. In a later iteration I deleted the padding and made it "look" like it was one graph, but I have had issues with the endpoints matching up since each facet is a separate graphical device. Can anyone give me some advice on how to do this? I am not particularly partial to the faceting so if it is not needed I do away with it.
I will try and upload sample data, but all simulating 100 values for each of the 9 rhos would work just to get it started like:
opt_lens_geom <- data.frame(rho= rep(seq(0.1,0.9,by=0.1),each=100),l_1024=rnorm(900))
You might consider ggridges. I've assumed here that you want a median value for each value of rho.
library(ggplot2)
library(ggridges)
library(dplyr)
set.seed(1001)
opt_lens_geom <- data.frame(rho = rep(seq(0.1, 0.9, by = 0.1), each = 100),
l_1024 = rnorm(900))
opt_lens_geom %>%
mutate(rho_f = factor(rho)) %>%
ggplot(aes(l_1024, rho_f)) +
stat_density_ridges(quantiles = 2, quantile_lines = TRUE)
Result. You can add scale = 1 as a parameter to stat_density_ridges if you don't like the amount of overlap.
Try the following. It uses a pre-computed data frame of the medians.
library(ggplot2)
df <- iris[c(1, 5)]
names(df) <- c("val", "rho")
med <- plyr::ddply(df, "rho", summarise, m = median(val))
ggplot(data = df, aes(x = val, fill = factor(rho))) +
geom_histogram() +
coord_flip() +
geom_vline(data = med, aes(xintercept = m), colour = 'black') +
facet_wrap(~ factor(rho))
You could do a variant on this using geom_violin instead of using histograms, although you wouldn't get labelled counts, just an idea of the relative density. Example with made up data:
df = data.frame(
rho = rep(c(0.1, 0.2, 0.3), each = 50),
val = sample(1:10, 150, replace = TRUE)
)
df$val = df$val + (5 * (df$rho == 0.2)) + (8 * (df$rho == 0.3))
ggplot(df, aes(x = rho, y = val, fill = factor(rho))) +
geom_violin() +
stat_summary(aes(group = 1), colour = "black",
geom = "line", fun.y = "median")
This produces a violin for each value of rho, and joins the medians for each violin.

`expand` argument in `scale_color_gradient` is ignored

I have a ggplot2 plot with a continuous color scale and I want to remove the extra bits above the maximum and below the minimum. For example:
set.seed(10)
n = 20
ggplot(data.frame(
x = rnorm(n),
y = rnorm(n),
col = c(0, 1, runif(n - 2)))) +
geom_point(aes(x, y, color = col))
See how the color scale extends a little above 1 and a little below 0? I want to get rid of that. But expand seems to be ignored. If I add scale_color_gradient(expand = c(0, 0)) to the above, there's no visible change. In fact, scale_color_gradient(expand = c(100, 100)) makes no difference, either. Is this a bug?
TLDR
It's not a bug. Increasing the nbin parameter in guide_colourbar (the function that shows continuous colour scales mapped onto values in ggplot2) will move the tick positions closer to the ends.
Explanation
guide_colourbar renders the range of colour values into a number of bins (with nbin = 20 bins by default. The first & last ticks showing the range of values are positioned at the midpoint of the first and last bins respectively.
Below are some illustrations for different nbin values. I also switched from the default raster = TRUE to raster = FALSE for clearer distinction between bins, as raster = TRUE comes with interpolation.
Setting up data / base plot
set.seed(10)
n = 20
df <- data.frame(x = rnorm(n),
y = rnorm(n),
col = c(0, 1, runif(n - 2)))
# base plot
p <- ggplot(df) +
geom_point(aes(x, y, color = col))
# extreme case: with only 2 bins, the two ticks corresponding to the
# lower & upper limits are positioned in the middle of each rectangles, with
# remaining ticks evenly spaced between them
p + ggtitle("nbin = 2") +
scale_colour_continuous(guide = guide_colourbar(nbin = 2, raster = FALSE))
# as we increase the number of bins, the upper / lower limit ticks move closer
# to the respective ends
p + ggtitle("nbin = 4") +
scale_colour_continuous(guide = guide_colourbar(nbin = 4, raster = FALSE))
p + ggtitle("nbin = 10") +
scale_colour_continuous(guide = guide_colourbar(nbin = 10, raster = FALSE))
# nbin = 20 is the default value; at this point, the upper / lower limit ticks
# are relatively close to the ends, but still distinct
p + ggtitle("nbin = 20") +
scale_colour_continuous(guide = guide_colourbar(nbin = 20, raster = FALSE))
# with 50 bins, the upper / lower limit ticks move closer to the ends, and
# the stacked rectangles are so thin that raster = FALSE doesn't really have
# much effect from here onwards; we can't visually distinguish the individual
# rectangles anymore
p + ggtitle("nbin = 50") +
scale_colour_continuous(guide = guide_colourbar(nbin = 50, raster = FALSE))
# with 100 bins, the upper / lower limit ticks are even closer
p + ggtitle("nbin = 100") +
scale_colour_continuous(guide = guide_colourbar(nbin = 100, raster = FALSE))
# with 500 bins, the upper / lower limits are practically at the ends now
p + ggtitle("nbin = 500") +
scale_colour_continuous(guide = guide_colourbar(nbin = 500, raster = FALSE))

density plot of a summed factor using stat_density_2d in ggplot

I have a dataset with 29 columns and 2500 rows resulting from an test. three columns need to be represented on a plot, the fist two are simple X,Y coordinate pairs representing actual X,Y positions on an image used in the test, the third is a response from the participants giving a simple yes or no answer (recorded as 1 and -1 respectively).
Each X,Y coordinate was used name times in the test, and I'm trying to get an overall bias for each point. The values can be found by a simple sum of the Y,N answers. My problem is that I can't plot the "sum" of the answers, only the density of the yes and no separately. I need to show the bias towards yes and no overall for each point, so having two plots or simply plotting the two sets of results on the same plot is on little value.
In the code I'm using the X value is audioDim1a and the Y value is audioDim2. There are 2 DFs used which have been reduced - one to include all the Y answers and the other all the N answers.
this code uses the two N & Y data frames
ggplot() +
xlim(0, 110) + ylim(0, 150) +
stat_density_2d(data = test_plot_N, aes(audioDim1a, audioDim2, alpha="density", fill = "density"), geom = "polygon", size = 0.2, contour = T, n = 150, h = 20, bins = 10, colour = "purple") + stat_density_2d(data = test_plot_Y, aes(audioDim1a, audioDim2, alpha="density", fill = "density"), geom = "polygon", size = 0.2, contour = T, n = 150, h = 20, bins = 10, colour = "green") + geom_point(data = test_plot_N, aes(audioDim1a, audioDim2), colour = "blue", size = 1)
If I use a dataset (see below) with the Y and N combined I hoped to get the situation where if the number of Y and N answers was equal the density would result in a 0 plot and thus the contour fill would be clear/white. This does not happen as it seems to simply show a count of responses rather than an arithmetic sum.
ggplot() +
xlim(0, 110) + ylim(0, 150) +
stat_density_2d(data = test_plot, aes(audioDim1a, audioDim2, alpha="density", fill = "density"), geom = "polygon", size = 0.2, contour = T, n = 150, h = 20, bins = 10, colour = "purple") +
geom_point(data = test_plot_N, aes(audioDim1a, audioDim2), colour = "blue", size = 1)
Do I need to supply the data set and the full R code I'm using?
Any help would be really appreciated.

ggplot: rescale axis (log) and cut axis

I want to plot a very simple boxplot like this in R:
desired graph
It is a log-link (Gamma distributed: jh_conc is a hormone concentration variable) Generalized linear model of a continuous dependent variable (jh_conc) for a categorical grouping variable (group: type of bee)
My script that I already have is:
> jh=read.csv("data_jh_titer.csv",header=T)
> jh
group jh_conc
1 Queens 6.38542714
2 Queens 11.22512563
3 Queens 7.74472362
4 Queens 11.56834171
5 Queens 3.74020100
6 Virgin Queens 0.06080402
7 Virgin Queens 0.12663317
8 Virgin Queens 0.08090452
9 Virgin Queens 0.04422111
10 Virgin Queens 0.14673367
11 Workers 0.03417085
12 Workers 0.02449749
13 Workers 0.02927136
14 Workers 0.01648241
15 Workers 0.02150754
fit1=glm(jh_conc~group,family=Gamma(link=log), data=jh)
ggplot(fit, aes(group, jh_conc))+
geom_boxplot(aes(fill=group))+
coord_trans(y="log")
the resulting plot looks like this:
My question is: what (geom) extensions can I use to split the y-axis and rescale them different? Also how do I add the black circles (averages; which are calculated on a log scale and then back-transformed to the original scale) horizontal lines which are significance levels based on posthoc tests performed on log transformed data: ** : p<0.01, *** :p< 0.001?
You can't create a broken numeric axis in ggplot2 by design, mainly because it visually distorts the data/differences being represented and is considered misleading.
You can however use scale_log10() + annotation_logticks() to help condense data across a wide range of values or better show heteroskedastic data. You can also use annotate to build out your p-value representation stars and bars.
Also you can easily grab information from a model using it's named attributes, here we care about fit$coef:
# make a zero intercept version for easy plotting
fit2 <- glm(jh_conc ~ 0 + group, family = Gamma(link = log), data = jh)
# extract relevant group means and use exp() to scale back
means <- data.frame(group = gsub("group", "",names(fit2$coef)), means = exp(fit2$coef))
ggplot(fit, aes(group, jh_conc)) +
geom_boxplot(aes(fill=group)) +
# plot the circles from the model extraction (means)
geom_point(data = means, aes(y = means),size = 4, shape = 21, color = "black", fill = NA) +
# use this instead of coord_trans
scale_y_log10() + annotation_logticks(sides = "l") +
# use annotate "segment" to draw the horizontal lines
annotate("segment", x = 1, xend = 2, y = 15, yend = 15) +
# use annotate "text" to add your pvalue *'s
annotate("text", x = 1.5, y = 15.5, label = "**", size = 4) +
annotate("segment", x = 1, xend = 3, y = 20, yend = 20) +
annotate("text", x = 2, y = 20.5, label = "***", size = 4) +
annotate("segment", x = 2, xend = 3, y = .2, yend = .2) +
annotate("text", x = 2.5, y = .25, label = "**", size = 4)

How to correctly interpret ggplot's stat_density2d

My initial goal was to plot a population of individual points and then draw a convex hull enclosing 80% of that population centered on the mass of the population.
After trying a number of ideas, the best solution I came up with was to use ggplot's stat_density2d. While this works great for a qualitative analysis, I still need to indicate an 80% boundary. I started out looking for a way to outline the 80th percentile population boundary, but I can work with an 80% probability density boundary instead.
Here's where I'm looking for help. The bin parameter for kde2d (used by stat_density2d) is not clearly documented. If I set bin = 4 in the example below, am I correct in interpreting the central (green) region as containing a 25% probability mass and the combined yellow, red, and green areas as representing a 75% probability mass? If so, by changing the bin to = 5, would the area inscribed then equal an 80% probability mass?
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
TestData <- ggplot (data = df) +
stat_density2d(aes(x = x, y = y, fill = as.factor(..level..)),
bins=4, geom = "polygon", ) +
geom_point(aes(x = x, y = y)) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "black"))
TestData
I repeated a number of test cases and manually counted the excluded points [would love to find a way to count them based on what ..level.. they were contained within] but given the random nature of the data (both my real data and the test data) the number of points outside of the stat_density2d area varied enough to warrant asking for help.
Summarizing, is there a practical means of drawing a polygon around the central 80% of the population of points in the data frame? Or, baring that, am I safe to use stat_density2d and set bin equal to 5 to produce an 80% probability mass?
Excellent answer from Bryan Hanson dispelling the fuzzy notion that I could pass an undocumented bin parameter in stat_density2d. The results looked close at values for bin around 4 to 6, but as he stated, the actual function is unknown and therefore not usable.
I used the HDRegionplot as provided in the accepted answer by DWin to solve my problem. To that, I added a center of gravity (COGravity) and point in polygon (pnt.in.poly) from the SDMTools package to complete the analysis.
library(MASS)
library(coda)
library(SDMTools)
library(emdbook)
library(ggplot2)
theme_set(theme_bw(16))
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
HPDregionplot(mcmc(data.matrix(df)), prob=0.8)
with(df, points(x,y))
ContourLines <- as.data.frame(HPDregionplot(mcmc(data.matrix(df)), prob=0.8))
df$inpoly <- pnt.in.poly(df, ContourLines[, c("x", "y")])$pip
dp <- df[df$inpoly == 1,]
COG100 <- as.data.frame(t(COGravity(df$x, df$y)))
COG80 <- as.data.frame(t(COGravity(dp$x, dp$y)))
TestData <- ggplot (data = df) +
stat_density2d(aes(x = x, y = y, fill = as.factor(..level..)),
bins=5, geom = "polygon", ) +
geom_point(aes(x = x, y = y, colour = as.factor(inpoly)), alpha = 1) +
geom_point(data=COG100, aes(COGx, COGy),colour="white",size=2, shape = 4) +
geom_point(data=COG80, aes(COGx, COGy),colour="green",size=4, shape = 3) +
geom_polygon(data = ContourLines, aes(x = x, y = y), color = "blue", fill = NA) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "brown", "black", "white", "black", "white","black")) +
scale_colour_manual(values = c("red", "black"))
TestData
nrow(dp)/nrow(df) # actual number of population members inscribed within the 80% probability polgyon
Alright, let me start by saying I'm not entirely sure of this answer, and it's only a partial answer! There is no bin parameter for MASS::kde2d which is the function used by stat_density2d. Looking at the help page for kde2d and the code for it (seen simply by typing the function name in the console), I think the bin parameter is h (how these functions know to pass bin to h is not clear however). Following the help page, we see that if h is not provided, it is computed by MASS:bandwidth.nrd. The help page for that function says this:
# The function is currently defined as
function(x)
{
r <- quantile(x, c(0.25, 0.75))
h <- (r[2] - r[1])/1.34
4 * 1.06 * min(sqrt(var(x)), h) * length(x)^(-1/5)
}
Based on this, I think the answer to your last question ("Am I safe...") is definitely no. r in the above function is what you need for your assumption to be safe, but it is clearly modified, so you are not safe. HTH.
Additional thought: Do you have any evidence that your code is using your bins argument? I'm wondering if it is being ignored. If so, try passing h in place of bins and see if it listens.
HPDregionplot in package:emdbook is supposed to do that. It does use MASS::kde2d but it normalizes the result. It has the disadvantage to my mind that it requires an mcmc object.
library(MASS)
library(coda)
HPDregionplot(mcmc(data.matrix(df)), prob=0.8)
with(df, points(x,y))
Building on the answer by 42, I've simplified HPDregionplot() to reduce dependencies and remove the requirement to work with mcmc-objects. The function works on a two-column data.frame and creates no intermediate plots. Note, however, that the this approach breaks as soon as grDevices::contourLines() return multiple contours.
hpd_contour <- function (x, n = 50, prob = 0.95, ...) {
post1 <- MASS::kde2d(x[[1]], x[[2]], n = n, ...)
dx <- diff(post1$x[1:2])
dy <- diff(post1$y[1:2])
sz <- sort(post1$z)
c1 <- cumsum(sz) * dx * dy
levels <- sapply(prob, function(x) {
approx(c1, sz, xout = 1 - x)$y
})
as.data.frame(grDevices::contourLines(post1$x, post1$y, post1$z, levels = levels))
}
theme_set(theme_bw(16))
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
ContourLines <- hpd_contour(df, prob=0.8)
ggplot(df, aes(x = x, y = y)) +
stat_density2d(aes(fill = as.factor(..level..)), bins=5, geom = "polygon") +
geom_point() +
geom_polygon(data = ContourLines, color = "blue", fill = NA) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "brown", "black", "white", "black", "white","black")) +
scale_colour_manual(values = c("red", "black"))
Moreover, the workflow now easily extends to grouped data.
ContourLines <- iris[, c("Species", "Sepal.Length", "Sepal.Width")] %>%
group_by(Species) %>%
do(hpd_contour(.[, c("Sepal.Length", "Sepal.Width")], prob=0.8))
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.6) +
geom_polygon(data = ContourLines, fill = NA) +
guides(color = FALSE) +
theme(plot.margin = margin())

Resources