Error with ggplot2 mapping variable to y and using stat="bin" - r

I am using ggplot2 to make a histogram:
geom_histogram(aes(x=...), y="..ncount../sum(..ncount..)")
and I get the error:
Mapping a variable to y and also using stat="bin".
With stat="bin", it will attempt to set the y value to the count of cases in each group.
This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
If you want y to represent values in the data, use stat="identity".
See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
What causes this in general? I am confused about the error because I'm not mapping a variable to y, just histogram-ing x and would like the height of the histogram bar to represent a normalized fraction of the data (such that all the bar heights together sum to 100% of the data.)
edit: if I want to make a density plot geom_density instead of geom_histogram, do I use ..ncount../sum(..ncount..) or ..scaled..? I'm unclear about what ..scaled.. does.

The confusion here is a long standing one (as evidenced by the verbose warning message) that all starts with stat_bin.
But users don't typically realize that their confusion revolves around stat_bin, since they typically encounter problems while using either geom_bar or geom_histogram. Note the documentation for each: they both use stat = "bin" (in current ggplot2 versions this stat has been split into stat_bin for continuous data and stat_count for discrete data) by default.
But let's back up. geom_*'s control the actual rendering of data into some sort of geometric form. stat_*'s simply transform your data. The distinction is a bit confusing in practice, because adding a layer of stat_bin will, by default, invoke geom_bar and so it can seem indistinguishable from geom_bar when you're learning.
In any case, consider the "bar"-like geom's: histograms and bar charts. Both are clearly going to involve some binning of data somewhere along the line. But our data could either be pre-summarised or not. For instance, we might want a bar plot from:
x
a
a
a
b
b
b
or equivalently from
x y
a 3
b 3
The first hasn't been binned yet. The second is pre-binned. The default behavior for both geom_bar and geom_histogram is to assume that you have not pre-binned your data. So they will attempt to call stat_bin (for histograms, now stat_count for bar charts) on your x values.
As the warning says, it will then try to map y for you to the resulting counts. If you also attempt to map y yourself to some other variable you end up in Here There Be Dragons territory. Mapping y to functions of the variables returned by stat_bin (..count.., etc.) should be ok and should not throw that warning (it doesn't for me using #mnel's example above).
The take-away here is that for geom_bar if you've pre-computed the heights of the bars, always remember to use stat = "identity", or better yet use the newer geom_col which uses stat = "identity" by default. For geom_histogram it's very unlikely that you will have pre-computed the bins, so in most cases you just need to remember not to map y to anything beyond what's returned from stat_bin.
geom_dotplot uses it's own binning stat, stat_bindot, and this discussion applies here as well, I believe. This sort of thing generally hasn't been an issue with the 2d binning cases (geom_bin2d and geom_hex) since there hasn't been as much flexibility available in the analogous z variable to the binned y variable in the 1d case. If future updates start allowing more fancy manipulations of the 2d binning cases this could I suppose become something you have to watch out for there.

The documentation for geom_histogram states that it is an alias for stat_bin and geom_bar
The documentation for geom_density states that uses a smooth density estimate produced using stat_density
Following the links (or finding the help pages directly)
stat_bin
The documentation for stat_bin describes how stat_bin returns a data.frame with the following (additional) columns
count number of points in bin
density density of points in bin, scaled to integrate to 1
ncount count, scaled to maximum of 1
ndensity density, scaled to maximum of 1
stat_density
The documentation for stat_density describes how stat_density returns a data.frame with the following (additional) columns
density density estimate
count density * number of points - useful for stacked density plots
scaled density estimate, scaled to maximum of 1
To produce a plot on the same scale it would appear that you want ..ndensity.. from stat_bin and ..scaled.. from stat_density or ..density.. from both
ggplot(dd, aes(x=x)) +
geom_histogram(aes(y= ..density..)) +
geom_density(aes(y=..density..))
ggplot(dd, aes(x=x)) +
geom_histogram(aes(y= ..ndensity..)) +
geom_density(aes(y=..scaled..))

Related

specify order of variables in position dodge

I honestly don't know why this is being so hard.
I'm creating a simple scatter plot. The x axis is a continuous variable, and at every tick in x I need to plot four points with error bars. I'm using position dodge and everything works fine.
Each point has a different color, size and shape as governed by three further variables: color and shape are governed by factors, size by a continuous variable.
By default, the four points reflect the order of the levels in the color variable (red always left, then green, then blue) but I would like them to reflect the order of the size variable (the continuous one), smallest left and largest right. How do I specify that size should be prioritised when ordering points in position dodge? I tried using reverse ordering but then the points are ordered first according to the shape legend.
I could change the mapping between variable and aesthetics (all variables are fundamentally continuous and could be used with size) but I think it'd be useful to know how to specify the order in which multiple variables should be considered when dodging points.
The question is somewhat unclear unfortunately. You don't show "a simple scatter plot". You are showing some statistics (mean with error band??) for specific x values - although this is seemingly continuous, this looks as if you have categorised it beforehand - resulting in some summary statistics which you are plotting.
Also, it is not easy (impossible) to fully help you without knowing what you have done until now to come to where you are.
I have tried to reproduce a similar looking plot with mtcars.
Dodging is only possible by one group (but one group can contain more than one variable). To specify how to group, add group = ... to your aesthetics.
Like so:
library(tidyverse)
ggplot(filter(mtcars, carb %in% 1:4)) +
geom_point(aes(carb, mpg, size= gear, group = gear, shape = as.character(vs), color = as.factor(cyl)),
position = position_dodge(width = .5))
This is now dodged by gear, which is also used as size aesthetic.

R: Weighted Joyplot/Ridgeplot/Density Plot?

I am trying to create a joyplot using the ggridges package (based on ggplot2). The general idea is that a joyplot creates nicely scaled stacked density plots. However, I cannot seem to produce one of these using weighted density. Is there some way of incorporating sampling weights (for weighted density) in the calculation of the densities in the creation of a joyplot?
Here's a link to the documentation for the ggridges package: https://cran.r-project.org/web/packages/ggridges/ggridges.pdf I know a lot of packages based on ggplot can accept additional aesthetics, but I don't know how to add weights to this type of geom object.
Additionally, here is an example of an unweighted joyplot in ggplot. I am trying to convert this to a weighted plot with the density weighted according to pweight.
# Load package, set seed
library(ggplot)
set.seed(1)
# Create an example dataset
dat <- data.frame(group = c(rep("A",100), rep("B",100)),
pweight = runif(200),
val = runif(200))
# Create an example of an unweighted joyplot
ggplot(dat, aes(x = val, y = group)) + geom_density_ridges(scale= 0.95)
It looks like the way to do this is to use stat_density rather than the default stat_density_ridges. Per the docs you linked to:
Note that the default stat_density_ridges makes joint density
estimation across all datasets. This may not generate the desired
result when using faceted plots. As an alternative, you can set
stat = "density" to use stat_density. In this case, it is required
to add the aesthetic mapping height = ..density.. (see examples).
Fortunately, stat_density (unlike stat_density_ridges) understands the aesthetic weight and will pass it to the underlying density call. You end up with something like:
ggplot(dat, aes(x = val, y = group)) +
geom_density_ridges(aes(height=..density.., # Notice the additional
weight=pweight), # aes mappings
scale= 0.95,
stat="density") # and use of stat_density
The ..density.. variable is automatically generated by stat_density.
Note: It appears that when you use stat_density the x-axis range behaves a little differently: it will trim the density plot to the data range and drop the nice-looking tails. You can easily correct this by manually expanding your x-axis, but I thought it was worth mentioning.

How to interpret the different ggplot2 densities?

I am confused about the meaning of the following variants of geom_density in ggplot:
Can someone please explain the difference between these four calls:
geom_density(aes_string(x=myvar))
geom_density(aes_string(x=myvar, y=..density..))
geom_density(aes_string(x=myvar, y=..scaled..))
geom_density(aes_string(x=myvar, y=..count../sum(..count..)))
My understanding is that:
geom_density alone will produce a density whose area under the curve sums to 1
geom_density with ..density.. basically does the same... ?
the ..count../sum(..count..) will normalize the peak heights to be more like a normalized histogram, ensuring that all the heights sum to 1
the ..count.. by itself without the denominator will just multiply each bin by # of items in it
the ..scaled.. parameter will make it so the maximum value of the density is 1.
I find ..scaled.. very counterintuitive and have never seen it used if my interpretation of it is correct so I'd like to ignore that. I am mainly looking for a clarification of the differences between geom_density and a kind of normalized density plot, which I am assuming requires the ...count../... argument. thanks.
(Related: Error with ggplot2 mapping variable to y and using stat="bin")
The default aesthetic for stat_density is ..density.., so a call to geom_density which uses stat_density by default, will plot y = ..density.. by default.
You can see how the various columns are caculated by looking at the source code
..scaled.. is defined as
densdf$scaled <- densdf$y / max(densdf$y, na.rm = TRUE)
Feel free to ignore it if you wish.
Looking at the source code for stat_bin
The results are computed as such
res <- within(results, {
count[is.na(count)] <- 0
density <- count / width / sum(abs(count), na.rm=TRUE)
ncount <- count / max(abs(count), na.rm=TRUE)
ndensity <- density / max(abs(density), na.rm=TRUE)
})
So if you want to compare the results of geom_histogram (using the default stat = 'bin'), then you can set y = ..density.. and it will calculate count / sum(count) for you (accounting for the width of the bins)
If you wanted to compare geom_density(aes(y=..scaled..)) with stat_bin, then you would use geom_histogram(aes(y = ..ndensity..))
You could get them on the same scale by using ..count.. in both as well, however you would need to adjust the adjust parameter in stat_density to get the appropriately detailed approximation of the curve.

ggplot geom_bar vs geom_histogram

What is the difference (if any) between geom_bar and geom_histogram in ggplot? They seem to produce the same plot and take the same parameters.
Bar charts provide a visual presentation of categorical data. Examples:
The number of people with red, black and brown hair
Look at the geom_bar help file. The examples are all counts.
Wikipedia page
Histograms are used to plot density of interval (usually numeric) data. Examples,
Distributions of age and height
geom_hist help file. The examples are distribution of movie ratings.
ggplot2
After a bit more investigating, I think in ggplot2 there is no difference between geom_bar and geom_histogram. From the docs:
geom_histogram(mapping = NULL, data = NULL, stat = "bin",
position = "stack", ...)
geom_bar(mapping = NULL, data = NULL, stat = "bin",
position = "stack", ...)
I realise that in the geom_histogram docs it states:
geom_histogram is an alias for geom_bar plus stat_bin
but to be honest, I'm not really sure what this means, since my understanding of ggplot2 is that both stat_bin and geom_bar are layers (with a slightly different emphasis).
The default behavior is the same from both geom_bar and geom_histogram. This is because (and as #csgillespie mentioned), there is an implied stat_bin when you call geom_histogarm (understandable), and it is also the default statistics transformation applied to geom_bar (arguable behavior IMO). That's why you need to specify stat='identity' when you want the to plot the data as is.
The stat='bin' or stat_bin() is a statistical transformation that ggplot does for you. It provides you as output the variables surrounded with two dots (the ..count.. and ..density... If you don't specify stat='bin' you won't get those variables.
geom_bar() is for both x and y-values are categorical data -- so there are spaces between two bars as x-values are factor with distinct levels.
geom_histogram() is for one continuous data and one categorical data. Usually we put the continuous data to the x-axis (so the bars are touching each other as they are continuous) and categorical data to the y-axis.
There is another plot we can use to show the above situation (1 categorical 1 continuous) -- geom_boxplot(). Usually we use y-axis to represent the continuous data as it's going to be a vertical box-and-whisker.

Position-dodge warning with ggplot boxplot?

I'm trying to make a boxplot with ggplot2 using the following code:
p <- ggplot(
data,
aes(d$score, reorder(d$names d$scores, median))
) +
geom_boxplot()
I have factors called names and integers called scores.
My code produces a plot, but the graphic does not depict the boxes (only shows lines) and I get a warning message, "position_dodge requires non-overlapping x intervals." I've tried to adjust the height and width with geom_boxplot(width=5), but this does not seem to fix the problem. Can anyone suggest a possible solution to my problem?
I should point out that my boxplot is rather large and has about 200 name values on the y-axis). Perhaps this is the problem?
The number of groups is not the problem; I can see the same thing even when there are only 2 groups. The issue is that ggplot2 draws boxplots vertically (continuous along y, categorical along x) and you are trying to draw them horizontally (continuous along x, categorical along y).
Also, your example has several syntax errors and isn't reproducible because we don't have data/d.
Start with some mock data
dat <- data.frame(scores=rnorm(1000,sd=500),
names=sample(LETTERS, 1000, replace=TRUE))
Corrected version of your example code:
ggplot(dat, aes(scores, reorder(names, scores, median))) + geom_boxplot()
This is the horizontal lines you saw.
If you instead put the categorical on the x axis and the continuous on the y you get
ggplot(dat, aes(reorder(names, scores, median), scores)) + geom_boxplot()
Finally, if you want to flip the coordinate axes, you can use coord_flip(). There can be some additional problems with this if you are doing even more sophisticated things, but for basic boxplots it works.
ggplot(dat, aes(reorder(names, scores, median), scores)) +
geom_boxplot() + coord_flip()
In case anyone else arrives here wondering why they're seeing
Warning message:
position_dodge requires non-overlapping x intervals
Why this happens
The reason this happens is because some of the boxplot / violin plot (or other plot type) are possibly overlapping. In many cases, you may not care, but in some cases, it matters, hence why it warns you.
How to fix it
You have two options. Either suppress warnings when generating/printing the ggplot
The other option, simply alter the width of the plot so that the plots don't overlap, then the warning goes away. Try altering the width argument to the geom: e.g. geom_boxplot(width = 0.5) (same works for geom_violin())
In addition to #stevec's options, if you're seeing
position_stack requires non-overlapping x intervals
position_fill requires non-overlapping x intervals
position_dodge requires non-overlapping x intervals
position_dodge2 requires non-overlapping x intervals
and if your x variable is supposed to overlap for different aesthetics such as fill, you can try making the x_var into a factor:
geom_bar(aes(x = factor(x_var), fill = type)

Resources