Overlay density plot excludes histogram values - r

I want to overlay a density curve to a frequency histogram I have constructed. For the frequency histogram I used aes(y=..counts../40) because 40 is my total sample number. I used aes(y=..density..*0.1) to force the density to be somewhere between 0 and 1 since my binwidth is 0.1. However, density curve doesn't fit my data and it excludes the values that are equal to 1.0 (notice that the histogram shows accumulation values for the bin=(1.0,1.1) but the density curve ends at 1.0)
this is my data
data<-structure(list(variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("E1", "test"
), class = "factor"), value = c(0.288888888888889, 0.0817901234567901,
0.219026548672566, 0.584795321637427, 0.927554980595084, 0.44661095636026,
1, 0.653780942692438, 1, 0.806451612903226, 1, 0.276794335371741,
1, 0.930109557990178, 0.776864728192162, 0.824909747292419, 1,
1, 1, 1, 1, 0.0875912408759124, 0.308065494238933, 1, 0.0258064516129032,
0.0167322834645669, 1, 1, 0.355605889014723, 0.310344827586207,
0.106598984771574, 0.364447494852436, 0.174724342663274, 0.77491961414791,
1, 0.856026785714286, 0.680759275237274, 0.850657108721625, 1,
1, 0, 0.851851851851852, 1, 0, 0.294954721862872, 0.819870009285051,
0, 0.734147168531706, 0.0135424091233072, 0.0189098998887653,
0.0101010101010101, 0, 0.296905222437137, 0.706837929731772,
0.269279393173198, 0.135379061371841, 0.158969804618117, 0.0902981940361193,
0.00423131170662906, 0, 0.374880611270296, 0.0425790754257908,
0.145542753183748, 0, 0.129032258064516, 0.260334645669291, 0,
0, 1, 0.175505350772889, 0.08248730964467, 0, 0.317217981340119,
0.614147909967846, 0, 0.264508928571429, 0.883520276100086, 0.0657108721624851,
0, 0.560229445506692)), row.names = c(NA, -80L), .Names = c("variable",
"value"), class = "data.frame")
Plot
q<-ggplot(data, aes(value, fill = variable))
q + geom_density(alpha = 0.6,aes(y=..density..*0.1),binwidth=0.1)
+ theme_minimal()+scale_fill_manual(values =c("#D7191C","#2B83BA"))
+ theme(legend.position="bottom")+ guides(fill=guide_legend(nrow=1))
+ labs(title="Density Plot GrupoB",x="Respuesta",y="Density")
+scale_x_continuous(breaks=seq(from=0,to=1.2,by=0.1))
+geom_histogram(alpha = 0.6,aes(y=..count../40),binwidth=0.1,position="dodge")
The output I get is this

Your plot is doing exactly what is to be expected from your data:
You plot data$value, which contains numeric values between 0 and 1, so you should expect the density curve to run from 0 to 1 as well.
You plot a histogram with binwidth 0.1. Bins are closed on the lower and open on the upper end. So the binning you get in your case is [0,0.1), [0.1, 0.2), ..., [0.9,1.0), [1.0,1.1). You have 17 values in your data that are 1 and thus go into the last bin, which is plotted from 1 to 1.1.
I think it's a bad idea to plot the histogram the way you do. The reason is that for a histogram, the x-axis is continuous, meaning that the bar that covers the x-axis range from, say, 0.1 to 0.2 stands for the count of values between (and including) 0.1 and 0.2 (not including the latter). Using dodge in this situation leads to a distorted picture, since the bars do now no longer cover the correct x-axis range. Two bars share the range that should be covered in full by both of them. This distortion is one of the reasons why the density curve seems not to match the histogram.
So, what can you do about it? I can give you a few suggestions, but maybe others have better ideas...
Instead of plotting the histograms next to each other with position="dodge", you could use faceting, that is, plot the histograms (and corresponding density curves) into separate plots. This can be achieved by adding + facet_grid(variable~.) to your plot.
You could cheat a little bit to have the last bin, which is [0.9,1), include 1 (i.e. have it be [0.9,1.0]). Simply replace 1 in your data by 0.999 as follows: data$value[data$value==1]<-0.999. It is important that you do this only for the plot, where it really only means that you slightly redefine the binning. For all the numeric evaluations that you indent to do, you should not do this replacement! (It will, e.g., change the mean of data$value.)
Regarding the normalisation of your density curve and the histogram: there is no need for the density curve to lie between 0 and 1. The restriction is that the integral over the density curve should be 1. Thus, to make density curve and histogram compareable, also the histogram should have integral 1, which is achieved, by also dividing the y-value by the bindwidth. So, you should use geom_density(alpha = 0.6,aes(y=..density..)) (I also removed bindwith=0.1 because it has no effect for geom_density) and geom_histogram(alpha = 0.6,aes(y=..count../40/.1),binwidth=0.1) (no need for position="dodge", once you use faceting). This leads, of course, to exactly the relative normalisation that you had, but it makes more sense because the integrals over density curve and histogram are 1, as they should be.
The density curve does still not perfectly match the histogram and this has to do with the way the density estimator is calculated. I don't know this in detail and can thus unfortunately not explain it further. But you can get a better understanding of how it works by playing with the parameter adjust to geom_density. It will make the curve less smooth for smaller numbers and the curve will resemble the histogram more closely.
To put everything together, I have built all my suggestions into your code, used adjust=0.2 in geom_density and plotted the result:
data$value[data$value==1]<-0.999
q<-ggplot(data, aes(value, fill = variable))
q + geom_density(alpha = 0.6,aes(y=..density..),adjust=0.2) +
theme_minimal()+scale_fill_manual(values =c("#D7191C","#2B83BA")) +
theme(legend.position="bottom")+ guides(fill=guide_legend(nrow=1)) +
labs(title="Density Plot GrupoB",x="Respuesta",y="Density")+
scale_x_continuous(breaks=seq(from=0,to=1.2,by=0.1))+
geom_histogram(alpha = 0.6,aes(y=..count../40/.1),binwidth=0.1) +
facet_grid(variable~.)
Unfortunately, I can not give you a more complete answer, but I hope these ideas give you a good start.

Related

R - Recieving error "geom_line() can't have varying colour, linewidth along the line when linetype isn't solid" when I try to split by colour & lty

I made a dataframe from columns of a csv, I checked the df and nothing is off
position <- (SCATGRAPH$position)
species <- (SCATGRAPH$species)
value <- (SCATGRAPH$value)
habitat <- (SCATGRAPH$habitat)
df <- data.frame(position,species,value,habitat)
I then tried using different strategies found on this site, to create a line graph with four lines, grouped by linetype and colour so that four treatments can be distinguished.
ggplot(data=df, aes(x=habitat, y=value, group=species, colour=species, linetype=position)) + geom_line() + geom_point()
The above makes no graph, but elicits the message:
Error in `geom_line()`:
! Problem while converting geom to grob.
ℹ Error occurred in the 1st layer.
Caused by error in `draw_panel()`:
! `geom_line()` can't have varying colour, linewidth, and/or
alpha along the line when linetype isn't solid
Run `rlang::last_error()` to see where the error occurred.
The rlang::last_error() info is as follows
Backtrace:
1. base (local) `<fn>`(x)
2. ggplot2:::print.ggplot(x)
4. ggplot2:::ggplot_gtable.ggplot_built(data)
5. ggplot2:::by_layer(...)
12. ggplot2 (local) f(l = layers[[i]], d = data[[i]])
13. l$draw_geom(d, layout)
14. ggplot2 (local) draw_geom(..., self = self)
15. self$geom$draw_layer(...)
16. ggplot2 (local) draw_layer(..., self = self)
17. base::lapply(...)
18. ggplot2 (local) FUN(X[[i]], ...)
20. self$draw_panel(data, panel_params, coord, na.rm = FALSE)
21. ggplot2 (local) draw_panel(..., self = self)
When I try grouping by only one factor, either colour or linetype, it works exactly as expected
ggplot(data=df, aes(x=habitat, y=value, group=species, colour=species))+ geom_line()
Graph
The equivalent code also works when switching the "colour" out for "linetype", and again works if "species" is switched out for "position" (my other two-category variable - that I would like to associate with linetype).
There seem to be many ways of phrasing this request to R. Using geom_line(aes(linetype=position)) alone or in conjunction with the problem syntax above does not improve results nor does adding a scale_linetype_manual command as mentioned in other posts. Converting "species" or "position" to factors using factor() has not worked either.
EDIT: Here is my data
structure(list(habitat = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), levels = c("Upland interior",
"Upland edge", "Peatland edge", "Peatland interior"), class = "factor"),
position = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L), levels = c("On", "Off"), class = "factor"),
species = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("Deer", "Moose"), class = "factor"),
value = c(0.428571429, 0.31372549, 0.315789474, 0.315789474,
0.263157895, 0, 0.166666667, 0, 0.285714286, 0.196078431,
0, 0.263157895, 0.052631579, 0.157894737, 0, 0.15)), row.names = c(NA,
-16L), class = "data.frame")
Some of ggplot2's geoms (like geom_line) only have one color, linewidth, alpha, etc. per series of observations that should be grouped together aesthetically. By default, ggplot2 uses some heuristics to figure out which observations should be grouped together. The group aesthetic allows us to specify this explicitly. https://ggplot2.tidyverse.org/reference/aes_group_order.html
In your case, you have series which are distinguished by the combination of their species and their position. To make groups for each combination, we can use group = interaction(species, position). (We could also use paste(species, position) -- anything that gives a different value for each combination.)
(There is a separate but related question that sometimes comes up about how to depict a single series with varying aesthetics as it goes. For that, geom_segment and ggforce::geom_link are two good approaches.)

ggplot2 boxplots - How to group factors levels on the x-axis (and add reference lines for each group mean)

I have 30 plant species for which I have displayed the distributions of midday leaf water potential (lwp_md) using boxplots and the package ggplot2. But how do I group these species along the x-axis according to their leaf habits (e.g. Deciduous, Evergreen) as well as display a reference line indicating the mean lwp_md value for each leaf habit level?
I have attempted with the package forcats but really have no idea how to proceed with this one. I can't find anything after an extensive search online. The best I seem able to do is order species by some other function e.g. the median.
Below is an example of my code so far. Note I have used the packages ggplot2 and ggthemes:
library(ggplot2)
ggplot(zz, aes(x=fct_reorder(species, lwp_md, fun=median, .desc=T), y=lwp_md)) +
geom_boxplot(aes(fill=leaf_habit)) +
theme_few(base_size=14) +
theme(legend.position="top",
axis.text.x=element_text(size=8, angle=45, vjust=1, hjust =1)) +
xlab("Species") +
ylab("Maximum leaf water potential (MPa)") +
scale_y_reverse() +
scale_fill_discrete(name="Leaf habit",
breaks=c("DEC", "EG"),
labels=c("Deciduous", "Evergreen"))
Here's a subset of my data including 4 of my species (2 deciduous, 2 evergreen):
> dput(zz)
structure(list(id = 1:20, species = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L
), .Label = c("AMYELE", "BURSIM", "CASXYL", "COLARB"), class = "factor"),
leaf_habit = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("DEC",
"EG"), class = "factor"), lwp_md = c(-2.1, -2.5, -2.35, -2.6,
-2.45, -1.7, -1.55, -1.4, -1.55, -0.6, -2.6, -3.6, -2.9,
-3.1, -3.3, -2, -1.8, -2, -4.9, -5.35)), class = "data.frame", row.names = c(NA,
-20L))
An example of how I'm looking to display my data, cut and edited - I would like species on x-axis, lwp_md on y-axis:
gpplot defaults to ordering your factors alphabetically. To avoid this you have to supply them as ordered factors. This can be done by arranging the data.frame and then redeclaring the factors. To generate the mean value we can use group_by and mutate a new mean column in the df, that can later be plotted.
Here is the complete code:
library(ggplot)
library(ggthemes)
library(dplyr)
zz2 <- zz %>% arrange(leaf_habit) %>% group_by(leaf_habit) %>% mutate(mean=mean(lwp_md))
zz2$species <- factor(zz2$species,levels=unique(zz2$species))
ggplot(zz2, aes(x=species, y=lwp_md)) +
geom_boxplot(aes(fill=leaf_habit)) +
theme_few(base_size=14) +
theme(legend.position="top",
axis.text.x=element_text(size=8, angle=45, vjust=1, hjust =1)) +
xlab("Species") +
ylab("Maximum leaf water potential (MPa)") +
scale_y_reverse() +
scale_fill_discrete(name="Leaf habit",
breaks=c("DEC", "EG"),
labels=c("Deciduous", "Evergreen")) +
geom_errorbar(aes(species, ymax = mean, ymin = mean),
size=0.5, linetype = "longdash", inherit.aes = F, width = 1)

Mosaic plot: Discrete value supplied to continuous scale

I am engaging in a project about the relationship btw perception of voters and their voting behavior
The x variable - perception (PQ8_W3) is categorical variables with 4 values:
(1) winner
(2) loser
(3) can't say
(9) don't know
whereas y variable (same) is a 1/0 dummy.
I know x is a factor and I tried to change it to numeric, but I still cannot generate the mosaic graph. Also, I changed 9 to 4 so as to make it into continuous, but still doesn't work out.
Here is the dput data
c(NA, NA, NA, NA, 1, NA, NA, 1, 1, 0, 0, 1, 1, 1, 1)
structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 1L,
1L, 1L), .Label = c("1", "2", "3", "9"), class = "factor")
This is my code
library(ggplot2)
library(ggmosaic)
# here is fine, but I want to change the x and y variables
ggplot(data = AfD) + geom_mosaic(aes(x = product(same), fill=PQ8_W3), na.rm=TRUE)
# the error message comes out here, the console just gives me "Discrete value supplied to continuous scale"
ggplot(data = AfD) + geom_mosaic(aes(x = product(PQ8_W3), fill=(same), na.rm=TRUE))
Hope someone can help.

How to plot errorbars on this plot and change the overlay?

Hi have this dataset :
tdat=structure(list(Condition = structure(c(1L, 3L, 2L, 1L, 3L, 2L,
1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L,
3L, 2L, 1L, 3L, 2L), .Label = c("AS", "Dup", "MCH"), class = "factor"),
variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L), .Label = c("Bot", "Top", "All"), class = "factor"),
value = c(1.782726022, 1, 2.267946449, 1.095240234, 1, 1.103630141,
1.392545278, 1, 0.854984833, 4.5163067, 1, 4.649271897, 0.769428018,
1, 0.483117123, 0.363854608, 1, 0.195799358, 0.673186975,
1, 1.661568993, 1.174998373, 1, 1.095026419, 1.278455823,
1, 0.634152231)), .Names = c("Condition", "variable", "value"
), row.names = c(NA, -27L), class = "data.frame")
> head(tdat)
Condition variable value
1 AS Bot 1.782726
2 MCH Bot 1.000000
3 Dup Bot 2.267946
4 AS Bot 1.095240
5 MCH Bot 1.000000
6 Dup Bot 1.103630
I can plot it like that using this code :
ggplot(tdat, aes(x=interaction(Condition,variable,drop=TRUE,sep='-'), y=value,
fill=Condition)) +
geom_point() +
scale_color_discrete(name='interaction levels')+
stat_summary(fun.y='mean', geom='bar',
aes(label=signif(..y..,4),x=as.integer(interaction(Condition,variable))))
I have 2 questions :
How to change the overlay so the black points are not hidden by the
bar chart (3points should be visible per column)
How to add vertical errorbar on top of the bars using the standard
deviation from the black points ?
I'm not much in favor of mixing error bars with a bar plot.
In ggplot2 geoms are drawn in the order you add them to the plot. So, in order to have the points not hidden, add them after the bars.
ggplot(tdat, aes(x=interaction(Condition,variable,drop=TRUE,sep='-'), y=value,
fill=Condition)) +
stat_summary(fun.data="mean_sdl", mult=1, geom="errorbar") +
stat_summary(fun.y='mean', geom='bar') +
geom_point(show_guide=FALSE) +
scale_fill_discrete(name='interaction levels')
Like this:
tdat$x <- with(tdat,interaction(Condition,variable,drop=TRUE,sep='-'))
tdat_err <- ddply(tdat,.(x),
summarise,ymin = mean(value) - sd(value),
ymax = mean(value) + sd(value))
ggplot(tdat, aes(x=x, y=value)) +
stat_summary(fun.y='mean', geom='bar',
aes(label=signif(..y..,4),fill=Condition)) +
geom_point() +
geom_errorbar(data = tdat_err,aes(x = x,ymin = ymin,ymax = ymax,y = NULL),width = 0.5) +
labs(fill = 'Interaction Levels')
I've cleaned up your code somewhat. You will run into fewer problems if you move any extraneous computations outside of your ggplot() call. Better to create the new x variable first. Everything is more readable that way too.
The overlaying issue just requires re-ordering the layers.
Note that you were using scale_colour_* when you had mapped fill not colour (this is a very common error).
The only other "trick" was the un-mapping of y. Normally, when things get tricky I omit aes from the top level ggplot call entirely to make sure that each layer gets only the aesthetics that it needs.
The error bars again I tend to create the data frame outside of ggplot first. I find that cleaner and easier to read.

move axis labels ggplot

I have produced a fact graph in ggplot2 and the x axis title (bottom) is touching the scale values slightly (it's worsened when I plot to .pdf device). How do I move the axis title down a smidge?
DF<-structure(list(race = structure(c(3L, 1L, 3L, 2L, 3L, 1L, 2L,
2L, 2L, 3L, 2L, 1L, 3L, 3L, 3L, 3L, 2L, 1L, 2L, 3L), .Label = c("asian",
"black", "white"), class = "factor"), gender = structure(c(1L,
1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L,
2L, 2L, 2L), .Label = c("female", "male"), class = "factor"),
score = c(0.0360497844302483, 0.149771418578119, 0.703017688328021,
1.32540102136392, 0.627084455719946, -0.320051801571444,
0.852281028633536, -0.440056896755573, 0.621765489966213,
0.58981396944136, 1.95257757882381, 0.127301498272644, -0.0906338578670778,
-0.637727808028146, -0.449607617033673, 1.03162398117388,
0.334259623567608, 0.0912327543652576, -0.0789977852804991,
0.511696466039959), time1 = c(75.9849658266583, 38.7148843859919,
54.3512613852158, 37.3210772390582, 83.8061071736856, 14.3853324033061,
79.2285735003004, 31.1324602891428, 22.2294730114138, 26.427263191766,
40.5529893144888, 19.2463281412667, 8.45085646487301, 97.6770352620696,
61.1874163107771, 31.3727683430548, 99.4155144857594, 79.0996849438957,
21.2504885323517, 94.1079332400361)), .Names = c("race",
"gender", "score", "time1"), class = "data.frame", row.names = c(NA,
-20L))
require(ggplot2)
p <- ggplot(DF, aes(score, time1, group=gender))
p + geom_point(aes(shape=19)) + facet_grid(race~gender) + scale_x_continuous('BLAH BLAH') +
scale_y_continuous('Some MOre Of theat Good Blahing')
In my data BLAH BLAH is touching the numbers. I need it to move down. How?
You can adjust the positioning of the x-axis title using:
+ opts(axis.title.x = theme_text(vjust=-0.5))
Play around with the -0.5 "vertical justification" parameter until it suits you/your display device.
This is an easy workaround, based on the answer provided here
Just add a line break; \n, at the start of your axes title; xlab("\nYour_x_Label") (Or at the end if you need to move your y label).
It doesn't offer as much control as Eduardo's suggestion in the comments; theme(axis.title.x = element_text(vjust=-0.5)), or use of margin, but it is much simpler!
I would like to note that this is not my answer but #JWilliman - their answer is in the comments on #Prasad Chalasani answer. I am writing this as the current upvoted answers did not actually work well for me but #JWilliman's solution does:
#Answer
+ theme(axis.title.x = element_text(margin = margin(t = 20))
This is because theme(axis.title.x = element_text(vjust = 0.5)) has been superseded and now moves the title/label a fixed distance regardless of the value you put in.

Resources