What is the difference (if any) between geom_bar and geom_histogram in ggplot? They seem to produce the same plot and take the same parameters.
Bar charts provide a visual presentation of categorical data. Examples:
The number of people with red, black and brown hair
Look at the geom_bar help file. The examples are all counts.
Wikipedia page
Histograms are used to plot density of interval (usually numeric) data. Examples,
Distributions of age and height
geom_hist help file. The examples are distribution of movie ratings.
ggplot2
After a bit more investigating, I think in ggplot2 there is no difference between geom_bar and geom_histogram. From the docs:
geom_histogram(mapping = NULL, data = NULL, stat = "bin",
position = "stack", ...)
geom_bar(mapping = NULL, data = NULL, stat = "bin",
position = "stack", ...)
I realise that in the geom_histogram docs it states:
geom_histogram is an alias for geom_bar plus stat_bin
but to be honest, I'm not really sure what this means, since my understanding of ggplot2 is that both stat_bin and geom_bar are layers (with a slightly different emphasis).
The default behavior is the same from both geom_bar and geom_histogram. This is because (and as #csgillespie mentioned), there is an implied stat_bin when you call geom_histogarm (understandable), and it is also the default statistics transformation applied to geom_bar (arguable behavior IMO). That's why you need to specify stat='identity' when you want the to plot the data as is.
The stat='bin' or stat_bin() is a statistical transformation that ggplot does for you. It provides you as output the variables surrounded with two dots (the ..count.. and ..density... If you don't specify stat='bin' you won't get those variables.
geom_bar() is for both x and y-values are categorical data -- so there are spaces between two bars as x-values are factor with distinct levels.
geom_histogram() is for one continuous data and one categorical data. Usually we put the continuous data to the x-axis (so the bars are touching each other as they are continuous) and categorical data to the y-axis.
There is another plot we can use to show the above situation (1 categorical 1 continuous) -- geom_boxplot(). Usually we use y-axis to represent the continuous data as it's going to be a vertical box-and-whisker.
Related
In R, I am using the command plot(Strength, Weight, col= Area) to plot a scatterplot, with Weight as the explanatory numerical variable, and Area as the categorical explanatory variable, and Strength as the response.
There are, say, 6 areas, 1-6, but how can I tell which colour is associated with which area?
The scatterplot is coming out fine, but I can't tell which area the 6 colours on the scatterplot belong to.
You need to add a legend to your plot, see for instance https://www.geeksforgeeks.org/add-legend-to-plot-in-r/
But it will be easier to use the package ggplot2, which makes a legend for you, automatically. Something like, assuming your variables are in data frame yourdata :
library(ggplot2)
ggplot(yourdata, aes(Strength, Weight, color= Area)) +
geom_point()
Learning ggplot2 (gg is "grammar of graphics") will save you time in the long run!
Context: when you have "many" categories it can become hard to distinguish them in a bar plot. I found the plot below dealing with this situation quite nicely by linking the legend with categories in the plot.
Question: is it possible to do something similar with ggplot2?
With ggplot2 it is straighforward to get this:
But I really do not know were to start to acheive the result shown in the 1st plot.
Here is some code to sort it out:
library(ggplot2)
ggplot(data = mtcars, aes(x = vs, y = disp, fill = factor(carb))) +
geom_bar(stat = "identity")
Expected output (not as nice as the one presented above but it shows the idea)
There is no proper legend on the axes in any of the plots, but my guess is that the desired chart is based on relative frequencies, while your plot seems to show absolute frequencies, though I'm not sure about that.
Assuming that you want to produce a stacked bar chart giving the (relative) number of observations of a categorial variable in two groups, there are two ways to get the two stacked bars to be of the same height:
There need to be the exact same amount of observations in both of
them. Then you can use absolute frequencies.
The absolute frequencies need to be transformed to relative frequencies (or percent) by dividing them by the total number of observations in each group.
You can calculate the relative frequencies yourself and use them as the y-values.
Or refer to this post, as it seems to describe exactly what you want using ggplot2.
I honestly don't know why this is being so hard.
I'm creating a simple scatter plot. The x axis is a continuous variable, and at every tick in x I need to plot four points with error bars. I'm using position dodge and everything works fine.
Each point has a different color, size and shape as governed by three further variables: color and shape are governed by factors, size by a continuous variable.
By default, the four points reflect the order of the levels in the color variable (red always left, then green, then blue) but I would like them to reflect the order of the size variable (the continuous one), smallest left and largest right. How do I specify that size should be prioritised when ordering points in position dodge? I tried using reverse ordering but then the points are ordered first according to the shape legend.
I could change the mapping between variable and aesthetics (all variables are fundamentally continuous and could be used with size) but I think it'd be useful to know how to specify the order in which multiple variables should be considered when dodging points.
The question is somewhat unclear unfortunately. You don't show "a simple scatter plot". You are showing some statistics (mean with error band??) for specific x values - although this is seemingly continuous, this looks as if you have categorised it beforehand - resulting in some summary statistics which you are plotting.
Also, it is not easy (impossible) to fully help you without knowing what you have done until now to come to where you are.
I have tried to reproduce a similar looking plot with mtcars.
Dodging is only possible by one group (but one group can contain more than one variable). To specify how to group, add group = ... to your aesthetics.
Like so:
library(tidyverse)
ggplot(filter(mtcars, carb %in% 1:4)) +
geom_point(aes(carb, mpg, size= gear, group = gear, shape = as.character(vs), color = as.factor(cyl)),
position = position_dodge(width = .5))
This is now dodged by gear, which is also used as size aesthetic.
Here is an example of the code I'm working with
x<-as.factor(rep(c("tree_mean","tree_qmean","tree_skew"),3))
factor<-c(rep("mfn2_burned_99",3),rep("mfna_burned_5_7",3),rep("mfna_burned_5_7_10_12",3)))
y<-c(0.336457409,-0.347422910,-0.318945621,1.494109367, 0.003578698,-0.019985780,-0.484171146, 0.611589217,-0.322292664)
dat<-as.data.frame(cbind(x,factor,y))
head(dat)
x factor y
tree_mean mfn2_burned_99 -0.3364574
tree_qmean mfn2_burned_99 -0.3474229
tree_skew mfn2_burned_99 -0.3189456
tree_mean mfna_burned_5_7 -0.8269814
tree_qmean mfna_burned_5_7 -0.8088810
tree_skew mfna_burned_5_7 -2.5429226
tree_mean mfna_burned_5_7_10_12 -0.8601206
tree_qmean mfna_burned_5_7_10_12 -0.8474920
tree_skew mfna_burned_5_7_10_12 -2.9854178
I am trying to plot how much x deviates from 0, and facet it by each factor, as so:
ggplot(dat) +
geom_point(aes(x=x,y=y),shape=1,size=3)+
geom_linerange(aes(x=x,ymin=0,ymax=y))+
geom_hline(yintercept=0)+
facet_grid(factor~.)
This works fine when I have three factors (ignore the *: I had a significance column which I have since removed.
Example below:
However, I have 8 factors in total, and faceting obscures the plot such that the distance from zero for each x value gets very distorted.
Example below
So, my question is this: what would be a better way of coding/rendering this plot given my large number of x values and factors using faceting or color coding by factor in ggplot??
I would be very open to color-coding each distance for x by factor rather than faceting, but I have been beating my head against the wall trying to figure out how to even do that in ggplot (very new to ggplot), so I can't yet say if it would make the figure much more interpretable.
One option as you note is to color your point and/or linerange by a factor. You can then use position_dodge to move the points slightly on the x axis.
For example:
ggplot(dat, aes(color = factor)) +
geom_point(aes(x=x,y=y),shape=1,size=3, position = position_dodge(width = 0.5)+
geom_linerange(aes(x=x,ymin=0,ymax=y), position = position_dodge(width =0.5))+
geom_hline(yintercept=0)
I think this would still be difficult with many factors, but with 8 it might suit your purposes.
I am using ggplot2 to make a histogram:
geom_histogram(aes(x=...), y="..ncount../sum(..ncount..)")
and I get the error:
Mapping a variable to y and also using stat="bin".
With stat="bin", it will attempt to set the y value to the count of cases in each group.
This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
If you want y to represent values in the data, use stat="identity".
See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
What causes this in general? I am confused about the error because I'm not mapping a variable to y, just histogram-ing x and would like the height of the histogram bar to represent a normalized fraction of the data (such that all the bar heights together sum to 100% of the data.)
edit: if I want to make a density plot geom_density instead of geom_histogram, do I use ..ncount../sum(..ncount..) or ..scaled..? I'm unclear about what ..scaled.. does.
The confusion here is a long standing one (as evidenced by the verbose warning message) that all starts with stat_bin.
But users don't typically realize that their confusion revolves around stat_bin, since they typically encounter problems while using either geom_bar or geom_histogram. Note the documentation for each: they both use stat = "bin" (in current ggplot2 versions this stat has been split into stat_bin for continuous data and stat_count for discrete data) by default.
But let's back up. geom_*'s control the actual rendering of data into some sort of geometric form. stat_*'s simply transform your data. The distinction is a bit confusing in practice, because adding a layer of stat_bin will, by default, invoke geom_bar and so it can seem indistinguishable from geom_bar when you're learning.
In any case, consider the "bar"-like geom's: histograms and bar charts. Both are clearly going to involve some binning of data somewhere along the line. But our data could either be pre-summarised or not. For instance, we might want a bar plot from:
x
a
a
a
b
b
b
or equivalently from
x y
a 3
b 3
The first hasn't been binned yet. The second is pre-binned. The default behavior for both geom_bar and geom_histogram is to assume that you have not pre-binned your data. So they will attempt to call stat_bin (for histograms, now stat_count for bar charts) on your x values.
As the warning says, it will then try to map y for you to the resulting counts. If you also attempt to map y yourself to some other variable you end up in Here There Be Dragons territory. Mapping y to functions of the variables returned by stat_bin (..count.., etc.) should be ok and should not throw that warning (it doesn't for me using #mnel's example above).
The take-away here is that for geom_bar if you've pre-computed the heights of the bars, always remember to use stat = "identity", or better yet use the newer geom_col which uses stat = "identity" by default. For geom_histogram it's very unlikely that you will have pre-computed the bins, so in most cases you just need to remember not to map y to anything beyond what's returned from stat_bin.
geom_dotplot uses it's own binning stat, stat_bindot, and this discussion applies here as well, I believe. This sort of thing generally hasn't been an issue with the 2d binning cases (geom_bin2d and geom_hex) since there hasn't been as much flexibility available in the analogous z variable to the binned y variable in the 1d case. If future updates start allowing more fancy manipulations of the 2d binning cases this could I suppose become something you have to watch out for there.
The documentation for geom_histogram states that it is an alias for stat_bin and geom_bar
The documentation for geom_density states that uses a smooth density estimate produced using stat_density
Following the links (or finding the help pages directly)
stat_bin
The documentation for stat_bin describes how stat_bin returns a data.frame with the following (additional) columns
count number of points in bin
density density of points in bin, scaled to integrate to 1
ncount count, scaled to maximum of 1
ndensity density, scaled to maximum of 1
stat_density
The documentation for stat_density describes how stat_density returns a data.frame with the following (additional) columns
density density estimate
count density * number of points - useful for stacked density plots
scaled density estimate, scaled to maximum of 1
To produce a plot on the same scale it would appear that you want ..ndensity.. from stat_bin and ..scaled.. from stat_density or ..density.. from both
ggplot(dd, aes(x=x)) +
geom_histogram(aes(y= ..density..)) +
geom_density(aes(y=..density..))
ggplot(dd, aes(x=x)) +
geom_histogram(aes(y= ..ndensity..)) +
geom_density(aes(y=..scaled..))