I am confused about the meaning of the following variants of geom_density in ggplot:
Can someone please explain the difference between these four calls:
geom_density(aes_string(x=myvar))
geom_density(aes_string(x=myvar, y=..density..))
geom_density(aes_string(x=myvar, y=..scaled..))
geom_density(aes_string(x=myvar, y=..count../sum(..count..)))
My understanding is that:
geom_density alone will produce a density whose area under the curve sums to 1
geom_density with ..density.. basically does the same... ?
the ..count../sum(..count..) will normalize the peak heights to be more like a normalized histogram, ensuring that all the heights sum to 1
the ..count.. by itself without the denominator will just multiply each bin by # of items in it
the ..scaled.. parameter will make it so the maximum value of the density is 1.
I find ..scaled.. very counterintuitive and have never seen it used if my interpretation of it is correct so I'd like to ignore that. I am mainly looking for a clarification of the differences between geom_density and a kind of normalized density plot, which I am assuming requires the ...count../... argument. thanks.
(Related: Error with ggplot2 mapping variable to y and using stat="bin")
The default aesthetic for stat_density is ..density.., so a call to geom_density which uses stat_density by default, will plot y = ..density.. by default.
You can see how the various columns are caculated by looking at the source code
..scaled.. is defined as
densdf$scaled <- densdf$y / max(densdf$y, na.rm = TRUE)
Feel free to ignore it if you wish.
Looking at the source code for stat_bin
The results are computed as such
res <- within(results, {
count[is.na(count)] <- 0
density <- count / width / sum(abs(count), na.rm=TRUE)
ncount <- count / max(abs(count), na.rm=TRUE)
ndensity <- density / max(abs(density), na.rm=TRUE)
})
So if you want to compare the results of geom_histogram (using the default stat = 'bin'), then you can set y = ..density.. and it will calculate count / sum(count) for you (accounting for the width of the bins)
If you wanted to compare geom_density(aes(y=..scaled..)) with stat_bin, then you would use geom_histogram(aes(y = ..ndensity..))
You could get them on the same scale by using ..count.. in both as well, however you would need to adjust the adjust parameter in stat_density to get the appropriately detailed approximation of the curve.
Related
I am trying to create a joyplot using the ggridges package (based on ggplot2). The general idea is that a joyplot creates nicely scaled stacked density plots. However, I cannot seem to produce one of these using weighted density. Is there some way of incorporating sampling weights (for weighted density) in the calculation of the densities in the creation of a joyplot?
Here's a link to the documentation for the ggridges package: https://cran.r-project.org/web/packages/ggridges/ggridges.pdf I know a lot of packages based on ggplot can accept additional aesthetics, but I don't know how to add weights to this type of geom object.
Additionally, here is an example of an unweighted joyplot in ggplot. I am trying to convert this to a weighted plot with the density weighted according to pweight.
# Load package, set seed
library(ggplot)
set.seed(1)
# Create an example dataset
dat <- data.frame(group = c(rep("A",100), rep("B",100)),
pweight = runif(200),
val = runif(200))
# Create an example of an unweighted joyplot
ggplot(dat, aes(x = val, y = group)) + geom_density_ridges(scale= 0.95)
It looks like the way to do this is to use stat_density rather than the default stat_density_ridges. Per the docs you linked to:
Note that the default stat_density_ridges makes joint density
estimation across all datasets. This may not generate the desired
result when using faceted plots. As an alternative, you can set
stat = "density" to use stat_density. In this case, it is required
to add the aesthetic mapping height = ..density.. (see examples).
Fortunately, stat_density (unlike stat_density_ridges) understands the aesthetic weight and will pass it to the underlying density call. You end up with something like:
ggplot(dat, aes(x = val, y = group)) +
geom_density_ridges(aes(height=..density.., # Notice the additional
weight=pweight), # aes mappings
scale= 0.95,
stat="density") # and use of stat_density
The ..density.. variable is automatically generated by stat_density.
Note: It appears that when you use stat_density the x-axis range behaves a little differently: it will trim the density plot to the data range and drop the nice-looking tails. You can easily correct this by manually expanding your x-axis, but I thought it was worth mentioning.
How can I plot the relative proportions of two groups using a fill aesthetic in ggplot2?
I am asking this question here because several other answers on this topic seem incorrect (ex1, ex2, and ex3), but Cross Validated seems to have functionally banned R specific questions (CV meta). ..density.. is conceptually related to, but distinct from proportions (ex4 and ex5). So the correct answer does not seem to involve density.
Example:
set.seed(1200)
test <- data.frame(
test1 = factor(sample(letters[1:2], 100, replace = TRUE,prob=c(.25,.75)),ordered=TRUE,levels=letters[1:2]),
test2 = factor(sample(letters[3:8], 100, replace = TRUE),ordered=TRUE,levels=letters[3:8])
)
ggplot(test, aes(test2)) + geom_bar(aes(y = ..density.., group=test1, fill=test1) ,position="dodge")
#For example, the plotted data shows level a x c as being slightly in excess of .15, but a manual calculation shows a value of .138
counts <- with(test,table(test1,test2))
counts/matrix(rowSums(counts),nrow=2,ncol=6)
The answer that seems to yield an output that is correct resorts to a solution that doesn't use ggplot2 (calculating it outside of ggplot2) or requires that a panel be used rather than a fill aesthetic.
Edit: Digging into stat_bin yields that the function ultimately called is bin, but bin only gets passed the values in the x aes. Without rewriting stat_bin (or making another stat_) the hack that was applied in the above referenced answer can be generalized to the fill aes in the absence of the group aes with the following code for the y aes: y = ..count../sapply(fill, FUN=function(x) sum(count[fill == x])). This just replaces PANEL (the hidden column that is present at the end of StatBin) with fill). Presumably other hidden variables could get the same treatment.
This is an aweful hack, but it seems to do what you want...
ggplot(test, aes(test2)) + geom_bar(aes(y = ..count../rep(c(sum(..count..[1:6]), sum(..count..[7:12])), each=6),
group=test1, fill=test1) ,position="dodge") +
scale_y_continuous(name="proportion")
I am using ggplot2 to make a histogram:
geom_histogram(aes(x=...), y="..ncount../sum(..ncount..)")
and I get the error:
Mapping a variable to y and also using stat="bin".
With stat="bin", it will attempt to set the y value to the count of cases in each group.
This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
If you want y to represent values in the data, use stat="identity".
See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
What causes this in general? I am confused about the error because I'm not mapping a variable to y, just histogram-ing x and would like the height of the histogram bar to represent a normalized fraction of the data (such that all the bar heights together sum to 100% of the data.)
edit: if I want to make a density plot geom_density instead of geom_histogram, do I use ..ncount../sum(..ncount..) or ..scaled..? I'm unclear about what ..scaled.. does.
The confusion here is a long standing one (as evidenced by the verbose warning message) that all starts with stat_bin.
But users don't typically realize that their confusion revolves around stat_bin, since they typically encounter problems while using either geom_bar or geom_histogram. Note the documentation for each: they both use stat = "bin" (in current ggplot2 versions this stat has been split into stat_bin for continuous data and stat_count for discrete data) by default.
But let's back up. geom_*'s control the actual rendering of data into some sort of geometric form. stat_*'s simply transform your data. The distinction is a bit confusing in practice, because adding a layer of stat_bin will, by default, invoke geom_bar and so it can seem indistinguishable from geom_bar when you're learning.
In any case, consider the "bar"-like geom's: histograms and bar charts. Both are clearly going to involve some binning of data somewhere along the line. But our data could either be pre-summarised or not. For instance, we might want a bar plot from:
x
a
a
a
b
b
b
or equivalently from
x y
a 3
b 3
The first hasn't been binned yet. The second is pre-binned. The default behavior for both geom_bar and geom_histogram is to assume that you have not pre-binned your data. So they will attempt to call stat_bin (for histograms, now stat_count for bar charts) on your x values.
As the warning says, it will then try to map y for you to the resulting counts. If you also attempt to map y yourself to some other variable you end up in Here There Be Dragons territory. Mapping y to functions of the variables returned by stat_bin (..count.., etc.) should be ok and should not throw that warning (it doesn't for me using #mnel's example above).
The take-away here is that for geom_bar if you've pre-computed the heights of the bars, always remember to use stat = "identity", or better yet use the newer geom_col which uses stat = "identity" by default. For geom_histogram it's very unlikely that you will have pre-computed the bins, so in most cases you just need to remember not to map y to anything beyond what's returned from stat_bin.
geom_dotplot uses it's own binning stat, stat_bindot, and this discussion applies here as well, I believe. This sort of thing generally hasn't been an issue with the 2d binning cases (geom_bin2d and geom_hex) since there hasn't been as much flexibility available in the analogous z variable to the binned y variable in the 1d case. If future updates start allowing more fancy manipulations of the 2d binning cases this could I suppose become something you have to watch out for there.
The documentation for geom_histogram states that it is an alias for stat_bin and geom_bar
The documentation for geom_density states that uses a smooth density estimate produced using stat_density
Following the links (or finding the help pages directly)
stat_bin
The documentation for stat_bin describes how stat_bin returns a data.frame with the following (additional) columns
count number of points in bin
density density of points in bin, scaled to integrate to 1
ncount count, scaled to maximum of 1
ndensity density, scaled to maximum of 1
stat_density
The documentation for stat_density describes how stat_density returns a data.frame with the following (additional) columns
density density estimate
count density * number of points - useful for stacked density plots
scaled density estimate, scaled to maximum of 1
To produce a plot on the same scale it would appear that you want ..ndensity.. from stat_bin and ..scaled.. from stat_density or ..density.. from both
ggplot(dd, aes(x=x)) +
geom_histogram(aes(y= ..density..)) +
geom_density(aes(y=..density..))
ggplot(dd, aes(x=x)) +
geom_histogram(aes(y= ..ndensity..)) +
geom_density(aes(y=..scaled..))
I'm trying to make a boxplot with ggplot2 using the following code:
p <- ggplot(
data,
aes(d$score, reorder(d$names d$scores, median))
) +
geom_boxplot()
I have factors called names and integers called scores.
My code produces a plot, but the graphic does not depict the boxes (only shows lines) and I get a warning message, "position_dodge requires non-overlapping x intervals." I've tried to adjust the height and width with geom_boxplot(width=5), but this does not seem to fix the problem. Can anyone suggest a possible solution to my problem?
I should point out that my boxplot is rather large and has about 200 name values on the y-axis). Perhaps this is the problem?
The number of groups is not the problem; I can see the same thing even when there are only 2 groups. The issue is that ggplot2 draws boxplots vertically (continuous along y, categorical along x) and you are trying to draw them horizontally (continuous along x, categorical along y).
Also, your example has several syntax errors and isn't reproducible because we don't have data/d.
Start with some mock data
dat <- data.frame(scores=rnorm(1000,sd=500),
names=sample(LETTERS, 1000, replace=TRUE))
Corrected version of your example code:
ggplot(dat, aes(scores, reorder(names, scores, median))) + geom_boxplot()
This is the horizontal lines you saw.
If you instead put the categorical on the x axis and the continuous on the y you get
ggplot(dat, aes(reorder(names, scores, median), scores)) + geom_boxplot()
Finally, if you want to flip the coordinate axes, you can use coord_flip(). There can be some additional problems with this if you are doing even more sophisticated things, but for basic boxplots it works.
ggplot(dat, aes(reorder(names, scores, median), scores)) +
geom_boxplot() + coord_flip()
In case anyone else arrives here wondering why they're seeing
Warning message:
position_dodge requires non-overlapping x intervals
Why this happens
The reason this happens is because some of the boxplot / violin plot (or other plot type) are possibly overlapping. In many cases, you may not care, but in some cases, it matters, hence why it warns you.
How to fix it
You have two options. Either suppress warnings when generating/printing the ggplot
The other option, simply alter the width of the plot so that the plots don't overlap, then the warning goes away. Try altering the width argument to the geom: e.g. geom_boxplot(width = 0.5) (same works for geom_violin())
In addition to #stevec's options, if you're seeing
position_stack requires non-overlapping x intervals
position_fill requires non-overlapping x intervals
position_dodge requires non-overlapping x intervals
position_dodge2 requires non-overlapping x intervals
and if your x variable is supposed to overlap for different aesthetics such as fill, you can try making the x_var into a factor:
geom_bar(aes(x = factor(x_var), fill = type)
Is there a way to create a boxplot in R that will display with the box (somewhere) an "N=(sample size)"? The varwidth logical adjusts the width of the box on the basis of sample size, but that doesn't allow comparisons between different plots.
FWIW, I am using the boxplot command in the following fashion, where 'f1' is a factor:
boxplot(xvar ~ f1, data=frame, xlab="input values", horizontal=TRUE)
Here's some ggplot2 code. It's going to display the sample size at the sample mean, making the label multifunctional!
First, a simple function for fun.data
give.n <- function(x){
return(c(y = mean(x), label = length(x)))
}
Now, to demonstrate with the diamonds data
ggplot(diamonds, aes(cut, price)) +
geom_boxplot() +
stat_summary(fun.data = give.n, geom = "text")
You may have to play with the text size to make it look good, but now you have a label for the sample size which also gives a sense of the skew.
You can use the names parameter to write the n next to each factor name.
If you don't want to calculate the n yourself you could use this little trick:
# Do the boxplot but do not show it
b <- boxplot(xvar ~ f1, data=frame, plot=0)
# Now b$n holds the counts for each factor, we're going to write them in names
boxplot(xvar ~ f1, data=frame, xlab="input values", names=paste(b$names, "(n=", b$n, ")"))
To get the n on top of the bar, you could use text with the stat details provided by boxplot as follows
b <- boxplot(xvar ~ f1, data=frame, plot=0)
text(1:length(b$n), b$stats[5,]+1, paste("n=", b$n))
The stats field of b is
a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot.
The gplots package provides boxplot.n, which according to the documentation produces a boxplot annotated with the number of observations.
I figured out a workaround using the Envstats package. This package needs to be downloaded, loaded and activated using:
library(Envstats)
The stripChart (different from stripchart) does add to the chart some values such as the n values. First I plotted my boxplot. Then I used the add=T in the stripChart. Obviously, many things were hidden in the stripChart code so that they do not show up on the boxplot. Here is the code I used for the stripChart to hide most items.
Boxplot with integrated stripChart to show n values:
stripChart(data.frame(T0_G1,T24h_G1,T96h_G1,T7d_G1,T11d_G1,T15d_G1,T30d_G1), show.ci=F,axes=F,points.cex=0,n.text.line=1.6,n.text.cex=0.7,add=T,location.scale.text="none")
So boxplot
boxplot(data.frame(T0_G1,T24h_G1,T96h_G1,T7d_G1,T11d_G1,T15d_G1,T30d_G1),main="All Rheometry Tests on Egg Plasma at All Time Points at 0.1Hz,0.1% and 37 Set 1,2,3", names=c("0h","24h","96h","7d ", "11d", "15d", "30d"),boxwex=0.6,par(mar=c(8,4,4,2)))
Then stripChart
stripChart(data.frame(T0_G1,T24h_G1,T96h_G1,T7d_G1,T11d_G1,T15d_G1,T30d_G1), show.ci=F,axes=F,points.cex=0,n.text.line=1.6,n.text.cex=0.7,add=T,location.scale.text="none")
You can always adjust the high of the numbers (n values) so that they fit where you want.