ggplot2 histogram of factors showing the probability mass instead of count - r
I am trying to use the excellent ggplot2 using the bar geom to plot the probability mass rather than the count. However, using aes(y=..density..) the distribution does not sum to one (but is close). I think the problem might be due to the default binwidth for factors. Here is an example of the problem,
age <- c(rep(0,4), rep(1,4))
mppf <- c(1,1,1,0,1,1,0,0)
data.test <- as.data.frame(cbind(age,mppf))
data.test$age <- as.factor(data.test$age)
data.test$mppf <- as.factor(data.test$mppf)
p.test.density <- ggplot(data.test, aes(mppf, group=age, fill=age)) +
geom_bar(aes(y=..density..), position='dodge') +
scale_y_continuous(limits=c(0,1))
dev.new()
print(p.test.density)
I can get around this problem by keeping the x-variable as continuous and setting binwidth=1, but it doesn't seem very elegant.
data.test$mppf.numeric <- as.numeric(data.test$mppf)
p.test.density.numeric <- ggplot(data.test, aes(mppf.numeric, group=age, fill=age)) +
geom_histogram(aes(y=..density..), position='dodge', binwidth=1)+
scale_y_continuous(limits=c(0,1))
dev.new()
print(p.test.density.numeric)
I think you almost have it figured out, and would have once you realized you needed a bar plot and not a histogram.
The default width for bars with categorical data is .9 (See ?stat_bin. The help page for geom_bar doesn't give the default bar width but does send you to stat_bin for further reading.). Given that, your plots show the correct density for a bar width of .9. Simply change to a width of 1 and you will see the density values you expected to see.
ggplot(data.test, aes(x = mppf, group = age, fill = age)) +
geom_bar(aes(y=..density..), position = "dodge", width = 1) +
scale_y_continuous(limits=c(0,1))
Related
ggplot: why is the y-scale larger than the actual values for each response?
Likely a dumb question, but I cannot seem to find a solution: I am trying to graph a categorical variable on the x-axis (3 groups) and a continuous variable (% of 0 - 100) on the y-axis. When I do so, I have to clarify that the geom_bar is stat = "identity" or use the geom_col. However, the values still show up at 4000 on the y-axis, even after following the comments from Y-scale issue in ggplot and from Why is the value of y bar larger than the actual range of y in stacked bar plot?. Here is how the graph keeps coming out: I also double checked that the x variable is a factor and the y variable is numeric. Why would this still be coming out at 4000 instead of 100, like a percentage? EDIT: The y-values are simply responses from participants. I have a large dataset (N = 600) and the y-value are a percentage from 0-100 given by each participant. So, in each group (N = 200 per group), I have a value for the percentage. I wanted to visually compare the three groups based on the percentages they gave. This is the code I used to plot the graph. df$group <- as.factor(df$group) df$confid<- as.numeric(df$confid) library(ggplot2) plot <-ggplot(df, aes(group, confid))+ geom_col()+ ylab("confid %") + xlab("group")
Are you perhaps trying to plot the mean percentage in each group? Otherwise, it is not clear how a bar plot could easily represent what you are looking for. You could perhaps add error bars to give an idea of the spread of responses. Suppose your data looks like this: set.seed(4) df <- data.frame(group = factor(rep(1:3, each = 200)), confid = sample(40, 600, TRUE)) Using your plotting code, we get very similar results to yours: library(ggplot2) plot <-ggplot(df, aes(group, confid))+ geom_col()+ ylab("confid %") + xlab("group") plot However, if we use stat_summary, we can instead plot the mean and standard error for each group: ggplot(df, aes(group, confid)) + stat_summary(geom = "bar", fun = mean, width = 0.6, fill = "deepskyblue", color = "gray50") + geom_errorbar(stat = "summary", width = 0.5) + geom_point(stat = "summary") + ylab("confid %") + xlab("group")
Ggplot2 Boxplot width setting changes x-axis
I have produced a boxplot with a continuous x-axis unsing geom_boxplot() in ggplot2. However, as there are many boxes they appear as skinny lines. Another stackoverflow chain (see here) suggested using the width= argument to make all the boxes the same width. However, when I use this argument it changes the x-axis and some of the boxes just disappear! For example, take this example dataframe. I apologise for the number of observations this has but I think the quantity has to do with the problem as I couldn't reproduce it with a more simple boxplot: Lat<- c(50.70228,50.70228,50.70228,51.82067,51.82067,51.82067,52.45893,52.45893,52.45893,52.76478,52.76478,52.76478,52.78354,52.78354,52.78354,53.56102,53.56102,53.56102,53.65364,53.65364,53.65364,53.63130,53.63130,53.63130,54.19035,54.19035,54.19035,54.25751,54.25751,54.25751,54.23526,54.23526,54.23526,54.62469,54.62469,54.62469,54.67831,54.67831,54.67831,54.67900,54.67900,54.67900,54.94908,54.94908,54.94908,55.19456,55.19456,55.19456,54.79198,54.79198,54.79198,55.34981,55.34981,55.34981,55.85655,55.85655,55.85655,56.06078,56.06078,56.06078,55.84553,55.84553,55.84553,56.00197,56.00197,56.00197,56.71842,56.71842,56.71842,57.00116,57.00116,57.00116,57.06942,57.06942,57.06942,57.26815,57.26815,57.26815,57.45532,57.45532,57.45532,57.88596,57.88596,57.88596,51.07711,51.07711,51.07711,51.07801,51.07621,51.11159,51.11159,51.11159,52.02484,52.02484,52.02484,52.02581,52.02581,52.02581,52.02685,52.02685,52.02685,52.05353,52.05353,52.05626,52.05353,52.05353,52.05353,52.05353,52.05353,52.05353,51.93541,51.93541,51.93541,51.93541,51.93541,51.93541,51.93541,51.93541,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.90810,52.90810,52.90810,52.90810,52.90810,52.90810,52.78968,52.78778,52.78968,52.78968,52.78881,52.78883,52.78883,52.78883,52.78970,52.78970,52.79506,52.79506,52.79506,53.77270,53.77276,53.77109,53.77109,53.77276,53.76845,53.76845,53.77109,53.76845,53.77109,53.87020,53.87020,53.87020,53.87103,53.88205,53.88205,53.88205,53.88205,53.87701,53.87701,53.87098,53.87098,53.87098,53.86932,53.86932,53.86932,56.51869,56.51869,56.51869,56.55870,56.55870,56.55870,56.55964,56.55964,56.55964,57.51056,57.49542,57.49542,57.50878,57.50878,57.50878,57.45201,57.45477,57.45192,57.45192,57.45192) y <- c(33.45407,21.40954,27.73487,20.38318,26.65483,31.68201,23.95467,20.77363,32.94192,22.71228,25.78824,28.39449,35.60615,24.29325,22.95047,25.65343,30.23262,22.05534,37.20565,35.53812,38.20211,39.38034,35.16619,38.82336,29.72370,38.25754,26.51339,39.38283,29.57483,31.80111,24.52967,34.83037,21.75038,35.50868,39.41830,21.96971,22.82504,32.69746,35.10747,27.75669,34.96690,37.61921,37.17226,20.50448,39.26582,22.08668,28.41502,36.69530,23.69404,23.18052,33.27420,23.04157,33.17285,32.00579,21.83845,22.97143,32.27190,21.53771,38.65481,20.14341,33.62718,39.86755,39.77881,30.59810,27.65909,24.11646,34.56981,29.30249,34.99361,32.39553,28.90443,34.88775,22.77049,36.44468,30.64496,35.81501,31.77673,24.19058,39.36298,21.47219,23.02268,31.37647,27.28457,33.14749,23.20842,39.73427,39.81399,35.51515,24.55080,39.41190,29.59987,38.46791,20.94479,37.22109,26.36060,30.91641,39.25975,39.88288,22.59061,30.24439,21.66110,30.36878,28.76901,38.75561,33.80408,31.05842,26.18921,21.30804,35.02966,33.85981,30.84373,31.67341,35.07605,37.93820,31.30481,21.45117,37.13626,25.70964,25.64736,38.58381,31.24448,26.55902,23.90817,33.70300,26.48909,37.73200,32.52413,22.44440,28.19878,32.46415,25.13711,26.66075,28.16254,20.40673,39.89327,30.83327,32.40196,39.81218,39.80391,21.87316,34.95792,33.38958,38.18441,22.03114,35.64410,34.90643,24.23056,36.66581,29.35813,20.86880,30.02044,36.13727,24.65558,39.43175,29.00154,29.78185,22.89196,37.15204,35.88188,28.73920,28.04934,37.50701,30.36306,28.39842,35.20973,26.54260,29.57763,26.03163,26.90440,27.60110,25.80086,39.98019,21.59970,28.83825,32.01711,20.50812,38.43331,32.41898,27.68722,32.59905,24.18150,29.05701,22.38512,32.93342,37.66694,37.65391,34.19613,23.89985,36.90012,20.74244,27.08511,29.21433,35.83771,35.59557,33.74533,27.08854,38.38994) V3 <-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2) df <- as.data.frame(cbind(Lat, y, as.factor(V3))) head(df) I plot it on a continuous x-axis as so: df_plot <- ggplot(df, aes(x=Lat, y=y, group=Lat))+ geom_boxplot(aes(colour=as.factor(V3)))+ theme_classic() df_plot Which produces: As you can see the boxes are represented as skinny lines. Therefore I tried to use the width= argument as so: df_plot2 <- ggplot(df, aes(x=Lat, y=y, group=Lat))+ geom_boxplot(aes(colour=as.factor(V3)), width=1)+ theme_classic() df_plot2 The output is: The main thing to notice here is that the x-axis range has suddenly changed! Some of the boxes are no longer plotted whilst others seem to be placed at different values of the x-axis. The range of the x-axis should be: range(df$Lat) [1] 50.70228 57.88596 I am completley perplexed as to why the x-axis would change by simply adding the width= argument in geom_boxplot(). I therefore tried to force the limits of the x-axis scale as so: df_plot3 <- ggplot(df, aes(x=Lat, y=y, group=Lat))+ geom_boxplot(aes(colour=as.factor(V3)), width=1)+ xlim(50,58)+ theme_classic() df_plot3 ouput: Please send help!
I think the strange behaviour comes from ggplot trying to automatically dodge your boxplots apart. By setting position = position_dodge(width = 0) the plot seems to be created as expected without changing the placement of boxes along the x-axis. (But gives a warning about overlapping x intervals) Lat<- c(50.70228,50.70228,50.70228,51.82067,51.82067,51.82067,52.45893,52.45893,52.45893,52.76478,52.76478,52.76478,52.78354,52.78354,52.78354,53.56102,53.56102,53.56102,53.65364,53.65364,53.65364,53.63130,53.63130,53.63130,54.19035,54.19035,54.19035,54.25751,54.25751,54.25751,54.23526,54.23526,54.23526,54.62469,54.62469,54.62469,54.67831,54.67831,54.67831,54.67900,54.67900,54.67900,54.94908,54.94908,54.94908,55.19456,55.19456,55.19456,54.79198,54.79198,54.79198,55.34981,55.34981,55.34981,55.85655,55.85655,55.85655,56.06078,56.06078,56.06078,55.84553,55.84553,55.84553,56.00197,56.00197,56.00197,56.71842,56.71842,56.71842,57.00116,57.00116,57.00116,57.06942,57.06942,57.06942,57.26815,57.26815,57.26815,57.45532,57.45532,57.45532,57.88596,57.88596,57.88596,51.07711,51.07711,51.07711,51.07801,51.07621,51.11159,51.11159,51.11159,52.02484,52.02484,52.02484,52.02581,52.02581,52.02581,52.02685,52.02685,52.02685,52.05353,52.05353,52.05626,52.05353,52.05353,52.05353,52.05353,52.05353,52.05353,51.93541,51.93541,51.93541,51.93541,51.93541,51.93541,51.93541,51.93541,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.92425,52.90810,52.90810,52.90810,52.90810,52.90810,52.90810,52.78968,52.78778,52.78968,52.78968,52.78881,52.78883,52.78883,52.78883,52.78970,52.78970,52.79506,52.79506,52.79506,53.77270,53.77276,53.77109,53.77109,53.77276,53.76845,53.76845,53.77109,53.76845,53.77109,53.87020,53.87020,53.87020,53.87103,53.88205,53.88205,53.88205,53.88205,53.87701,53.87701,53.87098,53.87098,53.87098,53.86932,53.86932,53.86932,56.51869,56.51869,56.51869,56.55870,56.55870,56.55870,56.55964,56.55964,56.55964,57.51056,57.49542,57.49542,57.50878,57.50878,57.50878,57.45201,57.45477,57.45192,57.45192,57.45192) y <- c(33.45407,21.40954,27.73487,20.38318,26.65483,31.68201,23.95467,20.77363,32.94192,22.71228,25.78824,28.39449,35.60615,24.29325,22.95047,25.65343,30.23262,22.05534,37.20565,35.53812,38.20211,39.38034,35.16619,38.82336,29.72370,38.25754,26.51339,39.38283,29.57483,31.80111,24.52967,34.83037,21.75038,35.50868,39.41830,21.96971,22.82504,32.69746,35.10747,27.75669,34.96690,37.61921,37.17226,20.50448,39.26582,22.08668,28.41502,36.69530,23.69404,23.18052,33.27420,23.04157,33.17285,32.00579,21.83845,22.97143,32.27190,21.53771,38.65481,20.14341,33.62718,39.86755,39.77881,30.59810,27.65909,24.11646,34.56981,29.30249,34.99361,32.39553,28.90443,34.88775,22.77049,36.44468,30.64496,35.81501,31.77673,24.19058,39.36298,21.47219,23.02268,31.37647,27.28457,33.14749,23.20842,39.73427,39.81399,35.51515,24.55080,39.41190,29.59987,38.46791,20.94479,37.22109,26.36060,30.91641,39.25975,39.88288,22.59061,30.24439,21.66110,30.36878,28.76901,38.75561,33.80408,31.05842,26.18921,21.30804,35.02966,33.85981,30.84373,31.67341,35.07605,37.93820,31.30481,21.45117,37.13626,25.70964,25.64736,38.58381,31.24448,26.55902,23.90817,33.70300,26.48909,37.73200,32.52413,22.44440,28.19878,32.46415,25.13711,26.66075,28.16254,20.40673,39.89327,30.83327,32.40196,39.81218,39.80391,21.87316,34.95792,33.38958,38.18441,22.03114,35.64410,34.90643,24.23056,36.66581,29.35813,20.86880,30.02044,36.13727,24.65558,39.43175,29.00154,29.78185,22.89196,37.15204,35.88188,28.73920,28.04934,37.50701,30.36306,28.39842,35.20973,26.54260,29.57763,26.03163,26.90440,27.60110,25.80086,39.98019,21.59970,28.83825,32.01711,20.50812,38.43331,32.41898,27.68722,32.59905,24.18150,29.05701,22.38512,32.93342,37.66694,37.65391,34.19613,23.89985,36.90012,20.74244,27.08511,29.21433,35.83771,35.59557,33.74533,27.08854,38.38994) V3 <-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2) library(ggplot2) df <- as.data.frame(cbind(Lat, y, as.factor(V3))) df_plot <- ggplot(df) + geom_boxplot(aes(colour=as.factor(V3), x=Lat, y=y, group=as.factor(Lat)), position=position_dodge(width = 0), width=1) + theme_classic()
ggplot2 density histogram with width=.5, vline and centered bar positions
I want a nice density (that sums to 1) histogram for some discrete data. I have tried a couple of ways to do this, but none were entirely satisfactory. Generate some data: #data set.seed(-999) d.test = data.frame(score = round(rnorm(100,1))) mean.score = mean(d.test[,1]) d1 = as.data.frame(prop.table(table(d.test))) The first gives the right placement of bars -- centered on top of the number -- but the wrong placement of vline(). This is because the x-axis is discrete (factor) and so the mean is plotted using the number of levels, not the values. The mean value is .89. ggplot(data=d1, aes(x=d.test, y=Freq)) + geom_bar(stat="identity", width=.5) + geom_vline(xintercept=mean.score, color="blue", linetype="dashed") The second gives the correct vline() placement (because the x-axis is continuous), but wrong placement of bars and the width parameter does not appear to be modifiable when x-axis is continuous (see here). I also tried the size parameter which also has no effect. Ditto for hjust. ggplot(d.test, aes(x=score)) + geom_histogram(aes(y=..count../sum(..count..)), width=.5) + geom_vline(xintercept=mean.score, color="blue", linetype="dashed") Any ideas? My bad idea is to rescale the mean so that it fits with the factor levels and use the first solution. This won't work well in case some of the factor levels are 'missing', e.g. 1, 2, 4 with no factor for 3 because no datapoint had that value. If the mean is 3.5, rescaling this is odd (x-axis is no longer an interval scale). Another idea is this: ggplot(d.test, aes(x=score)) + stat_bin(binwidth=.5, aes(y= ..density../sum(..density..)), hjust=-.5) + scale_x_continuous(breaks = -2:5) + #add ticks back geom_vline(xintercept=mean.score, color="blue", linetype="dashed") But this requires adjusting the breaks, and the bars are still in the wrong positions (not centered). Unfortunately, hjust does not appear to work. How do I get everything I want? density sums to 1 bars centered above values vline() at the correct number width=.5 With base graphics, one could perhaps solve this problem by plotting twice on the x-axis. Is there some similar way here?
It sounds like you just want to make sure that your x-axis values are numeric rather than factors ggplot(data=d1, aes(x=as.numeric(as.character(d.test)), y=Freq)) + geom_bar(stat="identity", width=.5) + geom_vline(xintercept=mean.score, color="blue", linetype="dashed") + scale_x_continuous(breaks=-2:3) which gives
Splitting distribution visualisations on the y-axis in ggplot2 in r
The most commonly cited example of how to visualize a logistic fit using ggplot2 seems to be something very much like this: data("kyphosis", package="rpart") ggplot(data=kyphosis, aes(x=Age, y = as.numeric(Kyphosis) - 1)) + geom_point() + stat_smooth(method="glm", family="binomial") This visualisation works great if you don't have too much overlapping data, and the first suggestion for crowded data seems to be to use injected jitter in the x and y coordinates of the points then adjust the alpha value of the points. When you get to the point where individual points aren't useful but distributions of points are, is it possible to use geom_density(), geom_histogram(), or something else to visualise the data but continue to split the categorical variable along the y-axis as it is done with geom_point()? From what I have found, geom_density() and geom_histogram() can easily be split/grouped by the categorical variable and both levels can easily be reversed using scale_y_reverse() but I can't figure out if it is even possible to move only one of the categorical variable distributions to the top of the plot. Any help/suggestions would be appreciated.
The annotate() function in ggplot allows you to add geoms to a plot with properties that "are not mapped from the variables of a data frame, but are instead in as vectors," meaning that you can add layers that are unrelated to your data frame. In this case your two density curves are related to the data frame (since the variables are in it), but because you're trying to position them differently, using annotate() is useful. Here's one way to go about it: data("kyphosis", package="rpart") model.only <- ggplot(data=kyphosis, aes(x=Age, y = as.numeric(Kyphosis) - 1)) + stat_smooth(method="glm", family="binomial") absents <- subset(kyphosis, Kyphosis=="absent") presents <- subset(kyphosis, Kyphosis=="present") dens.absents <- density(absents$Age) dens.presents <- density(presents$Age) scaling.factor <- 10 # Make the density plots taller model.only + annotate("line", x=dens.absents$x, y=dens.absents$y*scaling.factor) + annotate("line", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1) This adds two annotated layers with scaled density plots for each of the kyphosis groups. For the presents variable, y is scaled and increased by 1 to shift it up. You can also fill the density plots instead of just using a line. Instead of annotate("line"...) you need to use annotate("polygon"...), like so: model.only + annotate("polygon", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red", colour="black", alpha=0.4) + annotate("polygon", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1, fill="green", colour="black", alpha=0.4) Technically you could use annotate("density"...), but that won't work when you shift the present plot up by one. Instead of shifting, it fills the whole plot: model.only + annotate("density", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red") + annotate("density", x=dens.presents$x, y=dens.presents$y*scaling.factor + 1, fill="green") The only way around that problem is to use a polygon instead of a density geom. One final variant: flipping the top density plot along y-axis = 1: model.only + annotate("polygon", x=dens.absents$x, y=dens.absents$y*scaling.factor, fill="red", colour="black", alpha=0.4) + annotate("polygon", x=dens.presents$x, y=(1 - dens.presents$y*scaling.factor), fill="green", colour="black", alpha=0.4)
I am not sure I get your point, but here an attempt: dat <- rbind(kyphosis,kyphosis) dat$grp <- factor(rep(c('smooth','dens'),each = nrow(kyphosis)), levels = c('smooth','dens')) ggplot(dat,aes(x=Age)) + facet_grid(grp~.,scales = "free_y") + #geom_point(data=subset(dat,grp=='smooth'),aes(y = as.numeric(Kyphosis) - 1)) + stat_smooth(data=subset(dat,grp=='smooth'),aes(y = as.numeric(Kyphosis) - 1), method="glm", family="binomial") + geom_density(data=subset(dat,grp=='dens'))
Adding space between bars in ggplot2
I'd like to add spaces between bars in ggplot2. This page offers one solution: http://www.streamreader.org/stats/questions/6204/how-to-increase-the-space-between-the-bars-in-a-bar-plot-in-ggplot2. Instead of using factor levels for the x-axis groupings, however, this solution creates a numeric sequence, x.seq, to manually place the bars and then scales them using the width() argument. width() doesn't work, however, when I use factor level groupings for the x-axis as in the example, below. library(ggplot2) Treatment <- rep(c('T','C'),each=2) Gender <- rep(c('M','F'),2) Response <- sample(1:100,4) df <- data.frame(Treatment, Gender, Response) hist <- ggplot(df, aes(x=Gender, y=Response, fill=Treatment, stat="identity")) hist + geom_bar(position = "dodge") + scale_y_continuous(limits = c(0, 100), name = "") Does anyone know how to get the same effect as in the linked example, but while using factor level groupings?
Is this what you want? hist + geom_bar(width=0.4, position = position_dodge(width=0.5)) width in geom_bar determines the width of the bar. width in position_dodge determines the position of each bar. Probably you can easily understand their behavior after you play with them for a while.