geom_density blind in terms of the aesthetics supplied? - r

I have to admit that it has been a while since I used ggplot, but this seems a bit silly. Either I am missing something fundamental when trying to make a density plot, or there is a bug in ggplot2 (v3.3.2)
test <- data.frame(Time=rnorm(100),Age=rnorm(100))
ggplot(test,aes(y=Time,x=Age)) +
geom_density(aes(y=Time,x=Age))
produces
ggplot(test,aes(y=Time,x=Age)) +
geom_density(aes(y=Time,x=Age))
Error: geom_density requires the following missing aesthetics: y
how could the 'y' aesthetic be missing??

There are two cases when using geom_density(). It depends which stat layer you're specifying:
The standard case is the stat density which makes the geom_density() function compute its y values based on the frequency distribution of the given x values. In this case you must NOT proved a y aesthetic because those are computed behind the curtain.
Then there is a second case, which is yours, and which you have to specify explicitly by changing the stat to identity: This is needed if, for some reason, you've precalculated values which you want to feed directly into the density function.
Your problem arises, if you're mixing case 1) and 2). But I agree, the error message is not really clear, it could be mentioned to make sure that the used stat is the desired one.
library(ggplot2)
test <- data.frame(time = rnorm(100), age = rnorm(100))
#if you want to use precalculated y values you have to change the used stat to identity:
ggplot(test) +
geom_density(aes(x = age, y = time),
stat = "identity")
# compared to the case with the default value of stat: stat = "density"
ggplot(test) +
geom_density(aes(x = age))
Created on 2020-08-04 by the reprex package (v0.3.0)

If you want to plot the two variables in the graphic you need to "melt" it first.
test <- data.frame(Time=rnorm(100),Age=rnorm(100))
dt <- data.table(test)
dt_melt <- melt.data.table(dt)
ggplot(dt_melt,aes(x=value, fill=variable)) + geom_density(alpha=0.25)

Related

Frequency count histogram displaying only integer values on the y-axis?

I'd much appreciate anyone's help to resolve this question please. It seems like it should be so simple, but after many hours experimenting, I've had to stop in and ask for help. Thank you very much in advance!
Summary of question:
How can one ensure in ggplot2 the y-axis of a histogram is labelled using only integers (frequency count values) and not decimals?
The functions, arguments and datatype changes tried so far include:
geom_histogram(), geom_bar() and geom(col) - in each case, including, or not, the argument stat = "identity" where relevant.
adding + scale_y_discrete(), with or without + scale_x_discrete()
converting the underlying count data to a factor and/or the bin data to a factor
Ideally, the solution would be using baseR or ggplot2, instead of additional external dependencies e.g. by using the function pretty_breaks() func in the scales package, or similar.
Sample data:
sample <- data.frame(binMidPts = c(4500,5500,6500,7500), counts = c(8,0,9,3))
The x-axis consists of bins of a continuous variable, and the y-axis is intended to show the count of observations in those bins. For example, Bin 1 covers the x-axis range [4000 <= x < 5000], has a mid-point 4500, with 8 data points observed in that bin / range.
Code that almost works:
The following code generates a graph similar to the one I'm seeking, however the y-axis is labelled with decimal values on the breaks (which aren't valid as the data are integer count values).
ggplot(data = sample, aes (x = binMidPts, y = counts)) + geom_col()
Graph produced by this code is:
I realise I could hard-code the breaks / labels onto a scale_y_continuous() axis but (a) I'd prefer a flexible solution to apply to many differently sized datasets where the scale isn't know in advance, and (b) I expect there must be a simpler way to generate a basic histogram.
References
I've consulted many Stack Overflow questions, the ggplot2 manual (https://ggplot2.tidyverse.org/reference/scale_discrete.html), the sthda.com examples and various blogs. These tend to address related problems, e.g. using scale_y_continuous, or where count data is not available in the underlying dataset and thus rely on stat_bin() for a transformation.
Any help would be much appreciated! Thank you.
// Update 1 - Extending scale to zero
Future readers of this thread may find it helpful to know that the range of break values formed by base::pretty() does not necessarily extend to zero. Thus, the axis scale may omit values between zero and the lower range of the breaks, as shown here:
To resolve this, I included '0' in the range() parameter, i.e.:
ggplot(data = sample, aes (x = binMidPts, y = counts)) + geom_col() +
scale_y_continuous(breaks=round(pretty(range(0,sample$counts))))
which gives the desired full scale on the y-axis, thus:
How about:
ggplot(data = sample, aes (x = binMidPts, y = counts)) + geom_col() +
scale_y_continuous( breaks=round(pretty( range(sample$counts) )) )
This answer suggests pretty_breaks from the scales package. The manual page of pretty_breaks mentions pretty from base. And from there you just have to round it to the nearest integer.
The default y-axis breaks is calculated with scales::extended_breaks(). This function factory has a ... argument that passes on arguments to labeling::extended, which has a Q argument for what it considers 'nice numbers'. If you omit the 2.5 from the default, you should get integer breaks when the range is 3 or larger.
library(ggplot2)
library(scales)
sample <- data.frame(binMidPts = c(4500,5500,6500,7500), counts = c(8,0,9,3))
ggplot(data = sample, aes (x = binMidPts, y = counts)) +
geom_col() +
scale_y_continuous(
breaks = extended_breaks(Q = c(1, 5, 2, 4, 3))
)
Created on 2021-04-28 by the reprex package (v1.0.0)
Or you can calculate the breaks with some rules customized to the dataset you are working like this
library(ggplot2)
breaks_min <- 0
breaks_max <- max(sample[["counts"]])
# Assume 5 breaks is perferable
breaks_bin <- round((breaks_max - breaks_min) / 5)
custom_breaks <- seq(breaks_min, breaks_max, breaks_bin)
ggplot(data = sample, aes (x = binMidPts, y = counts)) +
geom_col() +
scale_y_continuous(breaks = custom_breaks, expand = c(0, 0))
Created on 2021-04-28 by the reprex package (v2.0.0)

r - scatterplot summary stat (e.g. sum or mean) for each point instead of individual data points

I am looking for a way to summarize data within a ggplot call, not before. I could pre-aggregate the data and then plot it, but I know there is a way to do it within a ggplot call. I'm just unsure how.
In this example, I want to get a mean for each (x,y) combo, and map it onto the colour aes
library(tidyverse)
df <- tibble(x = rep(c(1,2,4,1,5),10),
y = rep(c(1,2,3,1,5),10),
col = sample(c(1:100), 50))
df_summar <- df %>%
group_by(x,y) %>%
summarise(col_mean = mean(col))
ggplot(df_summar, aes(x=x, y=y, col=col_mean)) +
geom_point(size = 5)
I think there must be a better way to avoid the pre-ggplot step (yes, I could also have piped dplyr transformations into the ggplot, but the mechanics would be the same).
For instance, geom_count() counts the instances and plots them onto size aes:
ggplot(df, aes(x=x, y=y)) + geom_count()
I want the same, but mean instead of count, and col instead of size
I'm guessing I need stat_summary() or a stat() call (a replacement for ..xxx.. notation), but I can't get it to give me what I need.
You'll need stat_summary_2d:
ggplot(df, aes(x, y, z = col)) +
stat_summary_2d(aes(col = ..value..), fun = 'mean', geom = 'point', size = 5)
(Or calc(value), if you use the ggplot dev version, or read this in the future.)
You can pass any arbitrary function to fun.
While stat_summary seems like it would be useful, it is not in this case. It is specialized in the common transformation for plotting, summarizing a range of y values, grouped by x, into a set of summary statistics that are plotted as y(, ymin and ymax). You want to group by both x and y, so 2d it is.
Note that this uses binning however, so to get the points to accurately line up, you need to increase bin size (e.g. to 1e3). Unfortunately, there is no non-binning 2d summary stat.

How do I create a barplot in R with a cumulative standard deviation?

I want to make a plot similar to the one attached by Lindfield et al. 2016. I'm familiar with the ggplot command in R with the format:
ggplot(dataframe, aes(x, y)) + geom_bar(stat = 'identity')
However, I don't know how to make a cumulative se error for a stacked barplot; only one that employs a position_dodge command.
I know that there are disadvantages to using stacked bars with se errors, but for my data set, it is more presentable than using the unstacked barplots.
Thanks.
I don't know how you get the cumulative standard errors in an appropriate way (I guess it depends on how your values are generated) but I think you need to do calculate them and store them in a second DF, for example if you have an initial data.frame created like this:
DF <- data.frame( x=c("a","a","b","b"),
sp=c("shark","cod","shark","cod"),
y=c(10,5,15,7),
stringsAsFactors=FALSE )
where y is the value associated with each species at each x point, then you'd create a second DF containing the lower and upper limits of your s.e. for each x value, eg
seDF <- data.frame( x=c('a','b'),
yl=c(12,18),
yu=c(17,24),
stringsAsFactors=FALSE )
Then you can create your plot with:
ggplot() +
geom_bar( data=DF, mapping=aes(x=x,y=y,fill=sp),
position="stack", stat="identity") +
geom_linerange( data=seDF, mapping=aes(x=x, ymin=yl, ymax=yu) )
I used geom_linerange rather then geom_errorbar as it doesn't create crossbars at either end.

What does ..count.. mean in R? [duplicate]

Consider the following lines.
p <- ggplot(mpg, aes(x=factor(cyl), y=..count..))
p + geom_histogram()
p + stat_summary(fun.y=identity, geom='bar')
In theory, the last two should produce the same plot. In practice, stat_summary fails and complains that the required y aesthetic is missing.
Why can't I use ..count.. in stat_summary? I can't find anywhere in the docs information about how to use these variables.
Expanding #joran's comment, the special variables in ggplot with double periods around them (..count.., ..density.., etc.) are returned by a stat transformation of the original data set. Those particular ones are returned by stat_bin which is implicitly called by geom_histogram (note in the documentation that the default value of the stat argument is "bin"). Your second example calls a different stat function which does not create a variable named ..count... You can get the same graph with
p + geom_bar(stat="bin")
In newer versions of ggplot2, one can also use the stat function instead of the enclosing .., so aes(y = ..count..) becomes aes(y = stat(count)).

How can I plot the relative proportions of two groups using a fill aesthetic in ggplot2?

How can I plot the relative proportions of two groups using a fill aesthetic in ggplot2?
I am asking this question here because several other answers on this topic seem incorrect (ex1, ex2, and ex3), but Cross Validated seems to have functionally banned R specific questions (CV meta). ..density.. is conceptually related to, but distinct from proportions (ex4 and ex5). So the correct answer does not seem to involve density.
Example:
set.seed(1200)
test <- data.frame(
test1 = factor(sample(letters[1:2], 100, replace = TRUE,prob=c(.25,.75)),ordered=TRUE,levels=letters[1:2]),
test2 = factor(sample(letters[3:8], 100, replace = TRUE),ordered=TRUE,levels=letters[3:8])
)
ggplot(test, aes(test2)) + geom_bar(aes(y = ..density.., group=test1, fill=test1) ,position="dodge")
#For example, the plotted data shows level a x c as being slightly in excess of .15, but a manual calculation shows a value of .138
counts <- with(test,table(test1,test2))
counts/matrix(rowSums(counts),nrow=2,ncol=6)
The answer that seems to yield an output that is correct resorts to a solution that doesn't use ggplot2 (calculating it outside of ggplot2) or requires that a panel be used rather than a fill aesthetic.
Edit: Digging into stat_bin yields that the function ultimately called is bin, but bin only gets passed the values in the x aes. Without rewriting stat_bin (or making another stat_) the hack that was applied in the above referenced answer can be generalized to the fill aes in the absence of the group aes with the following code for the y aes: y = ..count../sapply(fill, FUN=function(x) sum(count[fill == x])). This just replaces PANEL (the hidden column that is present at the end of StatBin) with fill). Presumably other hidden variables could get the same treatment.
This is an aweful hack, but it seems to do what you want...
ggplot(test, aes(test2)) + geom_bar(aes(y = ..count../rep(c(sum(..count..[1:6]), sum(..count..[7:12])), each=6),
group=test1, fill=test1) ,position="dodge") +
scale_y_continuous(name="proportion")

Resources