ggplot: bin limits and ticks are misaligned - r

Look at the data and graph in this example:
df <- data.frame(x = round(rnorm(10000, mean=100, sd=15)))
df$x <- ifelse(df$x < 50, 50, df$x)
df$x <- ifelse(df$x > 150, 150, df$x)
library(ggplot2)
ggplot(df) +
aes(x = x) +
geom_histogram(aes(y = ..density..),
binwidth = 10,
fill="#69b3a2",
color="#e9ecef", alpha=0.9) +
stat_function(fun = dnorm, args = list(mean = mean(df$x),
sd = sd(df$x)))
The resulting graph is:
Note that the histogram goes outside the data bounds. The data is explicitly set to be limited to the 50 to 150 range, but the histogram seems to represent data from 45 to 155. In other words, the binning seems to be wrong.
Also note that the normal curves stops at the correct limits.
Is there a way to change the binning so that the bins go in the correct boundaries?
comment:
I have found work-arounds such as this
ggplot axis ticks fall at center of bin value rather than at the bin limits
but if I understand correctly, the idea here is to move the data by half the bin width, but that would be wrong at the other side in this case. It would also ruin the normal curve)

The default binning isn’t “wrong” per se, but you can control bin alignment by setting the boundary for one of your bins, using the boundary arg for geom_histogram():
library(ggplot2)
ggplot(df) +
aes(x = x) +
geom_histogram(aes(y = ..density..),
binwidth = 10,
fill="#69b3a2",
color="#e9ecef",
boundary = 50,
alpha=0.9) +
stat_function(fun = dnorm, args = list(mean = mean(df$x),
sd = sd(df$x)))
You could alternatively set center = 55 for the same result.

Related

facet_zoom can't change breaks of zoomed plot

I currently have a plot and have used facet_zoom to focus on records between 0 and 10 in the x axis. The following code reproduces an example:
require(ggplot2)
require(ggforce)
require(dplyr)
x <- rnorm(10000, 50, 25)
y <- rexp(10000)
data <- data.frame(x, y)
ggplot(data, aes(x = x, y = y)) +
geom_point() +
facet_zoom(x = dplyr::between(x, 0, 10))
I want to change the breaks on the zoomed portion of the graph to be the equivalent of:
ggplot(data, aes(x = x, y = y)) +
geom_point() +
facet_zoom(x = dplyr::between(x, 0, 10)) +
scale_x_continuous(breaks = seq(0,10,2))
But this changes the breaks of the original plot as well. Is it possible to just change the breaks of the zoomed portion whilst leaving the original plot as default?
This works for your use case:
ggplot(data, aes(x = x, y = y)) +
geom_point() +
facet_zoom(x = between(x, 0, 10)) +
scale_x_continuous(breaks = pretty)
From ?scale_x_continuous, breaks would accept the following (emphasis added):
One of:
NULL for no breaks
waiver() for the default breaks computed by the transformation object
A numeric vector of positions
A function that takes the limits as input and returns breaks as output
pretty() is one such function. It doesn't offer very fine control, but does allow you to have some leeway to specify breaks across different facets with very different scales.
For illustration, here are two examples with different desired number of breaks. See ?pretty for more details on the other arguments this function accepts.
p <- ggplot(data, aes(x = x, y = y)) +
geom_point() +
facet_zoom(x = between(x, 0, 10))
cowplot::plot_grid(
p + scale_x_continuous(breaks = function(x) pretty(x, n = 3)),
p + scale_x_continuous(breaks = function(x) pretty(x, n = 10)),
labels = c("n = 3", "n = 10"),
nrow = 1
)
Of course, you can also define your own function to convert plot limits into desired breaks, (e.g. something like p + scale_x_continuous(breaks = function(x) seq(min(x), max(x), length.out = 5))), but I generally find these functions require more tweaking to get right, & pretty() is often good enough.

gplot2 grids overlap out of boundary

I create points uniformly in [0,1] and each point has observations. But ggpolot shows some observations larger than 1 which is outside of the boundary. How come this can happen even though coordinates are within 0 and 1 range? Do you have any idea how to avoid this?
x=runif(10^6)
y=runif(10^6)
z=rnorm(10^6)
new.data=data.frame(x,y,z)
library(ggplot2)
ggplot(data=new.data) + stat_summary_2d(fun = mean, aes(x=x, y=y, z=z))
It’s an issue related to the grid used for binning.
Let’s use a smaller example.
set.seed(42)
x=runif(10^3)
y=runif(10^3)
z=rnorm(10^3)
new.data=data.frame(x,y,z)
library(ggplot2)
(g <- ggplot(data=new.data) +
stat_summary_2d(fun = mean, aes(x=x, y=y, z=z)) +
geom_point(aes(x, y)))
Now let’s zoom at that box on the upper left corner
g + coord_cartesian(xlim = c(0.02, 0.075), ylim = c(0.99, 1.035),
expand = FALSE)
As you can see, that box starts below y = 1 but extends above that value
because you are binning observations according to some binwidth.
The same phenomenon can occur if you use a histogram.
ggplot(data.frame(x = runif(1000, 0, 1)), aes(x)) +
geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In geom_histogram this can be aboided by setting the boundary argument
to 0 and the amount of bins to a multiple of the total length.
ggplot(data.frame(x = runif(1000, 0, 1)), aes(x)) +
geom_histogram(boundary = 0, binwidth = 0.1)
So the solution in your case is to set binwidth to 1/n where n is
an integer
ggplot(data=new.data) +
stat_summary_2d(fun = mean, aes(x=x, y=y, z=z), binwidth = 0.1) +
geom_point(aes(x, y))
Created on 2018-11-04 by the reprex package (v0.2.1.9000)
You have:
set.seed(1)
x=runif(10^6)
Here's what's going on behind the scenes:
bins <- 30L
range <- range(x)
origin <- 0L
binwidth <- diff(range)/bins
breaks <- seq(origin, range[2] + binwidth, binwidth)
bins <- cut(x, breaks, include.lowest = TRUE, right = TRUE, dig.lab = 7)
table(bins)
# ...
# (0.8999984,0.9333317] (0.9333317,0.9666649] (0.9666649,0.9999982]
# 33217 33039 33297
# (0.9999982,1.033331]
# 1
max(x)
# [1] 0.9999984
How come this can happen even though coordinates are within 0 and 1
range
binning starts at 0 (not a the minimum value)
each bin has a size of binwidth
there's a final bin that ends at the maximum value + binwidth, which gets the maximum value
Do you have any idea how to avoid this?
One way would be to define your own breaks:
ggplot(data=new.data) + stat_summary_2d(fun = mean, aes(x=x, y=y, z=z), breaks = seq(0, 1, .1))

ggplot density function not correct (normal percent from Stata)

I am trying to replicate the hist normal percent. The problem is that the density plot (or normal distribution) is completely off:
library(scales)
library(ggplot2)
a <- data.frame(rnorm(100,0,1))
colnames(a) <- c("test")
ggplot(a,aes(test)) +
geom_histogram(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels=scales::percent) +
stat_function(fun='dnorm')
The line should be much closer to the graph, but instead it is scaled by a factor of about 10.
Not familiar with that command from Stata, but is this what you want? Density with bar height scaled so that the total area integrates to 1, like the normal curve you showed. The reason your attempt doesn't work is because you didn't account for the bin width; each bin contributes area of the width times the count. You can do this manually if you set the bin width, or you can just use the computed ..density.. variable.
library(ggplot2)
set.seed(12345)
a <- data.frame(test = rnorm(100, 0, 1))
ggplot(a, aes(x = test)) +
geom_histogram(aes(y = ..count.. / (sum(..count..) * 0.2)), binwidth = 0.2) +
scale_y_continuous(labels = scales::percent) +
stat_function(fun = "dnorm")
ggplot(a, aes(x = test)) +
geom_histogram(aes(y = ..density..), binwidth = 0.2) +
scale_y_continuous(labels = scales::percent) +
stat_function(fun = "dnorm")
Created on 2018-08-27 by the reprex package (v0.2.0).

Specifying the scale for the density in ggplot2's stat_density2d

I'm looking to create multiple density graphs, to make an "animated heat map."
Since each frame of the animation should be comparable, I'd like the density -> color mapping on each graph to be the same for all of them, even if the range of the data changes for each one.
Here's the code I'd use for each individual graph:
ggplot(data= this_df, aes(x=X, y=Y) ) +
geom_point(aes(color= as.factor(condition)), alpha= .25) +
coord_cartesian(ylim= c(0, 768), xlim= c(0,1024)) + scale_y_reverse() +
stat_density2d(mapping= aes(alpha = ..level..), geom="polygon", bins=3, size=1)
Imagine I use this same code, but 'this_df' changes on each frame. So in one graph, maybe density ranges from 0 to 4e-4. On another, density ranges from 0 to 4e-2.
By default, ggplot will calculate a distinct density -> color mapping for each of these. But this would mean the two graphs-- the two frames of the animation--aren't really comparable. If this were a histogram or density plot, I'd simply make a call to coord_cartesian and change the x and y lim. But for the density plot, I have no idea how to change the scale.
The closest I could find is this:
Overlay two ggplot2 stat_density2d plots with alpha channels
But I don't have the option of putting the two density plots on the same graph, since I want them to be distinct frames.
Any help would be hugely appreciated!
EDIT:
Here's a reproducible example:
set.seed(4)
g = list(NA,NA)
for (i in 1:2) {
sdev = runif(1)
X = rnorm(1000, mean = 512, sd= 300*sdev)
Y = rnorm(1000, mean = 384, sd= 200*sdev)
this_df = as.data.frame( cbind(X = X,Y = Y, condition = 1:2) )
g[[i]] = ggplot(data= this_df, aes(x=X, y=Y) ) +
geom_point(aes(color= as.factor(condition)), alpha= .25) +
coord_cartesian(ylim= c(0, 768), xlim= c(0,1024)) + scale_y_reverse() +
stat_density2d(mapping= aes(alpha = ..level.., color= as.factor(condition)), geom="contour", bins=4, size= 2)
}
print(g) # level has a different scale for each
I would like to leave an update for this question. As of July 2016, stat_density2d is not taking breaks any more. In order to reproduce the graphic, you need to move breaks=1e-6*seq(0,10,by=2) to scale_alpha_continuous().
set.seed(4)
g = list(NA,NA)
for (i in 1:2) {
sdev = runif(1)
X = rnorm(1000, mean = 512, sd= 300*sdev)
Y = rnorm(1000, mean = 384, sd= 200*sdev)
this_df = as.data.frame( cbind(X = X,Y = Y, condition = 1:2) )
g[[i]] = ggplot(data= this_df, aes(x=X, y=Y) ) +
geom_point(aes(color= as.factor(condition)), alpha= .25) +
coord_cartesian(ylim= c(0, 768), xlim= c(0,1024)) +
scale_y_reverse() +
stat_density2d(mapping= aes(alpha = ..level.., color= as.factor(condition)),
geom="contour", bins=4, size= 2) +
scale_alpha_continuous(limits=c(0,1e-5), breaks=1e-6*seq(0,10,by=2))+
scale_color_discrete("Condition")
}
do.call(grid.arrange,c(g,ncol=2))
So to have both plots show contours with the same levels, use the breaks=... argument in stat_densit2d(...). To have both plots with the same mapping of alpha to level, use scale_alpha_continuous(limits=...).
Here is the full code to demonstrate:
library(ggplot2)
set.seed(4)
g = list(NA,NA)
for (i in 1:2) {
sdev = runif(1)
X = rnorm(1000, mean = 512, sd= 300*sdev)
Y = rnorm(1000, mean = 384, sd= 200*sdev)
this_df = as.data.frame( cbind(X = X,Y = Y, condition = 1:2) )
g[[i]] = ggplot(data= this_df, aes(x=X, y=Y) ) +
geom_point(aes(color= as.factor(condition)), alpha= .25) +
coord_cartesian(ylim= c(0, 768), xlim= c(0,1024)) + scale_y_reverse() +
stat_density2d(mapping= aes(alpha = ..level.., color= as.factor(condition)),
breaks=1e-6*seq(0,10,by=2),geom="contour", bins=4, size= 2)+
scale_alpha_continuous(limits=c(0,1e-5))+
scale_color_discrete("Condition")
}
library(gridExtra)
do.call(grid.arrange,c(g,ncol=2))
And the result...
Not sure how useful this is, but I found it easier to either use:
scale_fill_gradient(low = "purple", high = "yellow", limits = c(0, 1000))
Where you can overwrite the limits of the plot easily, choose colors etc. and you can just add it at the end of your code so it'll overwrite most things it needs to, so it's easy to use
or a similar solution using:
library(viridis)#colors for heat map
scale_fill_viridis(option = 'inferno')+
scale_fill_viridis_c(limits = c(0, 1000))

R: prevent break in line showing time series data using ggplot geom_line

Using ggplot2 I want to draw a line that changes colour after a certain date. I expected this to be be simple, but I get a break in the line at the point the colour changes. Initially I thought this was a problem with group (as per this question; this other question also looked relevant but wasn't quite what I needed). Having messed around with the group aesthetic for 30 minutes I can't fix it so if anybody can point out the obvious mistake...
Code:
require(ggplot2)
set.seed(1111)
mydf <- data.frame(mydate = seq(as.Date('2013-01-01'), by = 'day', length.out = 10),
y = runif(10, 100, 200))
mydf$cond <- ifelse(mydf$mydate > '2013-01-05', "red", "blue")
ggplot(mydf, aes(x = mydate, y = y, colour = cond)) +
geom_line() +
scale_colour_identity(mydf$cond) +
theme()
If you set group=1, then 1 will be used as the group value for all data points, and the line will join up.
ggplot(mydf, aes(x = mydate, y = y, colour = cond, group=1)) +
geom_line() +
scale_colour_identity(mydf$cond) +
theme()

Resources