ggplot: why is the y-scale larger than the actual values for each response? - r

Likely a dumb question, but I cannot seem to find a solution: I am trying to graph a categorical variable on the x-axis (3 groups) and a continuous variable (% of 0 - 100) on the y-axis. When I do so, I have to clarify that the geom_bar is stat = "identity" or use the geom_col.
However, the values still show up at 4000 on the y-axis, even after following the comments from Y-scale issue in ggplot and from Why is the value of y bar larger than the actual range of y in stacked bar plot?.
Here is how the graph keeps coming out:
I also double checked that the x variable is a factor and the y variable is numeric. Why would this still be coming out at 4000 instead of 100, like a percentage?
EDIT:
The y-values are simply responses from participants. I have a large dataset (N = 600) and the y-value are a percentage from 0-100 given by each participant. So, in each group (N = 200 per group), I have a value for the percentage. I wanted to visually compare the three groups based on the percentages they gave.
This is the code I used to plot the graph.
df$group <- as.factor(df$group)
df$confid<- as.numeric(df$confid)
library(ggplot2)
plot <-ggplot(df, aes(group, confid))+
geom_col()+
ylab("confid %") +
xlab("group")

Are you perhaps trying to plot the mean percentage in each group? Otherwise, it is not clear how a bar plot could easily represent what you are looking for. You could perhaps add error bars to give an idea of the spread of responses.
Suppose your data looks like this:
set.seed(4)
df <- data.frame(group = factor(rep(1:3, each = 200)),
confid = sample(40, 600, TRUE))
Using your plotting code, we get very similar results to yours:
library(ggplot2)
plot <-ggplot(df, aes(group, confid))+
geom_col()+
ylab("confid %") +
xlab("group")
plot
However, if we use stat_summary, we can instead plot the mean and standard error for each group:
ggplot(df, aes(group, confid)) +
stat_summary(geom = "bar", fun = mean, width = 0.6,
fill = "deepskyblue", color = "gray50") +
geom_errorbar(stat = "summary", width = 0.5) +
geom_point(stat = "summary") +
ylab("confid %") +
xlab("group")

Related

Why does specifying fill in an Aesthetic mapping change the figure in the plot

When trying highlight a part of a plot, I got an output I didn't expect.
This is the code I'm using to plot the density function of student grades from my dataset.
grades <- student_data$G3
q_aprox = function(x) return (qnorm(x, mean(grades), sd(grades)))
ggplot(student_data, aes(x = G3)) +
# -- IMPORTANT PART BEGIN -- #
geom_density(
color = 'steelblue',
alpha = 0.3,
position = 'stack'
) +
geom_density(
aes(fill = q_aprox(0.025) < G3 & G3 < q_aprox(0.975)),
alpha = 0.3,
position = 'stack'
) + theme_minimal()
# -- IMPORTANT PART END -- #
Unexpectedly, the plot I got from the first geom_density is different than the one I got from the second geom_density. I expected that, since the x and y mappings are left untouched, the plots would be the same.
Why doesn't this happen?
grades, or student_data$G3, is a numeric vector of size 395 with discrete values from 0 up to 20.
Here's the plot that's produced from the previous code
Output Plot - Not enough reputation to post images, sorry
The left tail on the second call is bigger than the one on the first. Also, the output in general seems to be "more spiked".
I recently watched part 1 of ggplot2's workshop on YouTube in preparation for this college assignment. That's more or less my knowledge level regarding ggplot2.
Specifying the fill aesthetic with a discrete variable triggers heuristics in ggplot2 that will automatically group your data. The densities are calculed per group, and densities integrate to 1. Therefore, if you calculate two densities of two groups of unequal sizes, densities still integrate to 1, so the the area of the densities does not reflect the unequal group sizes.
Below is an example of two groups, wherein group A is 10x as large as group B and the groups have different means. You'll notice that if we don't group the data, the resulting density peaks at -1: the center/mean of group A. However, when we auto-group the data with the fill aesthetic, both densities will peak at their own means, but the area of group B is as large as group A (it continuous behind the blue/green density).
library(ggplot2)
library(patchwork)
df <- data.frame(
x = c(rnorm(1000, -1), rnorm(100, 1)),
group = rep(c("A", "B"), c(1000, 100))
)
g1 <- ggplot(df, aes(x)) +
geom_density()
g2 <- ggplot(df, aes(x, fill = group)) +
geom_density()
g1 | g2
If you want to retain proportions to the group sizes, you can use y = after_stat(count) to use the computed variable count, which is the density estimate (which integrates to 1) times the number of observations. You can read about computed variables in the documentation under the header "computed variables" in for example ?geom_density.
ggplot(df, aes(x, fill = group)) +
geom_density(aes(y = after_stat(count)))
Created on 2021-05-12 by the reprex package (v0.3.0)

R-Programming - ggplot2 - boxplot issues (varwidth & position_dodge / stat_summary & position_dodge)

I am currently using ggplot2 to display some feature distributions with boxplots.
I can produce some simple boxplots, changing color, form, etc. but I cannot achieve the ones that combine several options.
1°)
My purpose is to display side by side boxplots for men and for women, which can be done with position = position_dodge(width=0.9).
I want that the width of the boxplot be proportional to the size of the sample, which can be done with var_width=TRUE.
First problem: when I put the two options together, it does not work and I get the following message:
position_dodge requires non-overlapping x intervals
Boxplot when using var_width=TRUE and position_dodge together:
I have tried to change the size of the plot, but it did not help. If I skip var_width=TRUE, then the boxplots are correctly dodged.
Is there a way out to this or is this a limit of ggplot2?
2°)
Besides, I want to display the size of each sample building the boxplots.
I can get the calculation with stat_summary(fun.data = give.n, but unfortunately, I have not found a way to avoid that the numbers overlap over each other when the boxplots are of similar positions.
I tried to use hjust & vjust to change the numbers’ positions, but they seem to share the same origin, so that does not help.
Overlapping numbers produced by stats_summary when boxplots are dodged:
As there are not labels, I could not use geom_text or I do not find a way how to get the stat passed to the geom_text.
So the second problem is: how can I nicely display each number on its own boxplot?
Here is my code:
`library(ggplot2)
# function to get the median of my sample
give.n <- function(x){
return(c(y = median(x), label = length(x)))
}
plot_boxes <- function(mydf, mycolumn1, mycolumn2) {
mylegendx <- deparse(substitute(mycolumn1))
mylegendy <- deparse(substitute(mycolumn2))
g2 <- ggplot(mydf, aes(x=as.factor(mycolumn1), y=mycolumn2, color=Gender,
fill=Gender)) +
geom_boxplot( data=mydf, aes(x=as.factor(mycolumn1), y=mycolumn2,
color=Gender), position=position_dodge(width=0.9), alpha=0.3) +
stat_summary(fun.data = give.n, geom = "text", size = 3, vjust=1) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_discrete(name = mylegendx ) +
labs(title=paste("Boxplot ", substring(mylegendy, 11), " by ",
substring(mylegendx, 11)) , x = mylegendx, y = mylegendy)
print(g2)
}
#setwd("~/data")
filename <- "df_stackoverflow.csv"
df_client <- read.csv(file=filename, header=TRUE, sep=";", dec=".")
plot_boxes(df_client, df_client$Client.Class, df_client$nbyears_client)`
And the data looks like this (small sample from the dataset - 20,000 lines):
Client.Id;Client.Status;Client.Class;Gender;nbyears_client
3;Active;Middle Class;Male;1.38
4;Active;Middle Class;Male;0.9
5;Active;Retiree;Female;0.21
6;Active;Middle Class;Male;0.9
7;Active;Middle Class;Male;3.55
8;Active;Subprime;Male;1.16
9;Active;Middle Class;Male;1.21
10;Active;Part-time;Male;3.38
17;Active;Middle Class;Male;1.83
19;Active;Subprime;Female;5.81
20;Active;Farming;Male;8.99
21;Active;Subprime;Female;6.49
22;Active;Middle Class;Male;1.54
23;Active;Middle Class;Female;2.74
24;Active;Subprime;Male;0.46
25;Active;Executive;Female;0.49
26;Active;Middle Class;Female;3.55
27;Active;Middle Class;Male;3.83
29;Active;Subprime;Female;2.66
30;Active;Middle Class;Male;2.72
31;Active;Middle Class;Female;4.88
32;Active;Subprime;Male;1.46
34;Active;Middle Class;Female;7.16
41;Active;Middle Class;Male;0.65
44;Active;Middle Class;Male;2
45;Active;Subprime;Male;1.13

How to change origin line position in ggplot bar graph?

Say I'm measuring 10 personality traits and I know the population baseline. I would like to create a chart for individual test-takers to show them their individual percentile ranking on each trait. Thus, the numbers go from 1 (percentile) to 99 (percentile). Given that a 50 is perfectly average, I'd like the graph to show bars going to the left or right from 50 as the origin line. In bar graphs in ggplot, it seems that the origin line defaults to 0. Is there a way to change the origin line to be at 50?
Here's some fake data and default graphing:
df <- data.frame(
names = LETTERS[1:10],
factor = round(rnorm(10, mean = 50, sd = 20), 1)
)
library(ggplot2)
ggplot(data = df, aes(x=names, y=factor)) +
geom_bar(stat="identity") +
coord_flip()
Picking up on #nongkrong's comment, here's some code that will do what I think you want while relabeling the ticks to match the original range and relabeling the axis to avoid showing the math:
library(ggplot2)
ggplot(data = df, aes(x=names, y=factor - 50)) +
geom_bar(stat="identity") +
scale_y_continuous(breaks=seq(-50,50,10), labels=seq(0,100,10)) + ylab("Percentile") +
coord_flip()
This post was really helpful for me - thanks #ulfelder and #nongkrong. However, I wanted to re-use the code on different data without having to manually adjust the tick labels to fit the new data. To do this in a way that retained ggplot's tick placement, I defined a tiny function and called this function in the label argument:
fix.labels <- function(x){
x + 50
}
ggplot(data = df, aes(x=names, y=factor - 50)) +
geom_bar(stat="identity") +
scale_y_continuous(labels = fix.labels) + ylab("Percentile") +
coord_flip()

ggplot2 histogram of factors showing the probability mass instead of count

I am trying to use the excellent ggplot2 using the bar geom to plot the probability mass rather than the count. However, using aes(y=..density..) the distribution does not sum to one (but is close). I think the problem might be due to the default binwidth for factors. Here is an example of the problem,
age <- c(rep(0,4), rep(1,4))
mppf <- c(1,1,1,0,1,1,0,0)
data.test <- as.data.frame(cbind(age,mppf))
data.test$age <- as.factor(data.test$age)
data.test$mppf <- as.factor(data.test$mppf)
p.test.density <- ggplot(data.test, aes(mppf, group=age, fill=age)) +
geom_bar(aes(y=..density..), position='dodge') +
scale_y_continuous(limits=c(0,1))
dev.new()
print(p.test.density)
I can get around this problem by keeping the x-variable as continuous and setting binwidth=1, but it doesn't seem very elegant.
data.test$mppf.numeric <- as.numeric(data.test$mppf)
p.test.density.numeric <- ggplot(data.test, aes(mppf.numeric, group=age, fill=age)) +
geom_histogram(aes(y=..density..), position='dodge', binwidth=1)+
scale_y_continuous(limits=c(0,1))
dev.new()
print(p.test.density.numeric)
I think you almost have it figured out, and would have once you realized you needed a bar plot and not a histogram.
The default width for bars with categorical data is .9 (See ?stat_bin. The help page for geom_bar doesn't give the default bar width but does send you to stat_bin for further reading.). Given that, your plots show the correct density for a bar width of .9. Simply change to a width of 1 and you will see the density values you expected to see.
ggplot(data.test, aes(x = mppf, group = age, fill = age)) +
geom_bar(aes(y=..density..), position = "dodge", width = 1) +
scale_y_continuous(limits=c(0,1))

Overlay raw data onto geom_bar

I have a data-frame arranged as follows:
condition,treatment,value
A , one , 2
A , one , 1
A , two , 4
A , two , 2
...
D , two , 3
I have used ggplot2 to make a grouped bar plot that looks like this:
The bars are grouped by "condition" and the colours indicate "treatment." The bar heights are the mean of the values for each condition/treatment pair. I achieved this by creating a new data frame containing the mean and standard error (for the error bars) for all the points that will make up each group.
What I would like to do is superimpose the raw jittered data to produce a bar-chart version of this box plot: http://docs.ggplot2.org/0.9.3.1/geom_boxplot-6.png [I realise that a box plot would probably be better, but my hands are tied because the client is pathologically attached to bar charts]
I have tried adding a geom_point object to my plot and feeding it the raw data (rather than the aggregated means which were used to make the bars). This sort of works, but it plots the raw values at the wrong x axis locations. They appear at the points at which the red and grey bars join, rather than at the centres of the appropriate bar. So my plot looks like this:
I can not figure out how to shift the points by a fixed amount and then jitter them in order to get them centered over the correct bar. Anyone know? Is there, perhaps, a better way of achieving what I'm trying to do?
What follows is a minimal example that shows the problem I have:
#Make some fake data
ex=data.frame(cond=rep(c('a','b','c','d'),each=8),
treat=rep(rep(c('one','two'),4),each=4),
value=rnorm(32) + rep(c(3,1,4,2),each=4) )
#Calculate the mean and SD of each condition/treatment pair
agg=aggregate(value~cond*treat, data=ex, FUN="mean") #mean
agg$sd=aggregate(value~cond*treat, data=ex, FUN="sd")$value #add the SD
dodge <- position_dodge(width=0.9)
limits <- aes(ymax=value+sd, ymin=value-sd) #Set up the error bars
p <- ggplot(agg, aes(fill=treat, y=value, x=cond))
#Plot, attempting to overlay the raw data
print(
p + geom_bar(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25) +
geom_point(data= ex[ex$treat=='one',], colour="green", size=3) +
geom_point(data= ex[ex$treat=='two',], colour="pink", size=3)
)
I found it is unnecessary to create separate dataframes. The plot can be created by providing ggplot with the raw data.
ex <- data.frame(cond=rep(c('a','b','c','d'),each=8),
treat=rep(rep(c('one','two'),4),each=4),
value=rnorm(32) + rep(c(3,1,4,2),each=4) )
p <- ggplot(ex, aes(cond,value,fill = treat))
p + geom_bar(position = 'dodge', stat = 'summary', fun.y = 'mean') +
geom_errorbar(stat = 'summary', position = 'dodge', width = 0.9) +
geom_point(aes(x = cond), shape = 21, position = position_dodge(width = 1))
You need just one call to geom_point() where you use data frame ex and set x values to cond, y values to value and color=treat (inside aes()). Then add position=dodge to ensure that points are dodgeg. With scale_color_manual() and argument values= you can set colors you need.
p+geom_bar(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25)+
geom_point(data=ex,aes(cond,value,color=treat),position=dodge)+
scale_color_manual(values=c("green","pink"))
UPDATE - jittering of points
You can't directly use positions dodge and jitter together. But there are some workarounds. If you save whole plot as object then with ggplot_build() you can see x positions for bars - in this case they are 0.775, 1.225, 1.775... Those positions correspond to combinations of factors cond and treat. As in data frame ex there are 4 values for each combination, then add new column that contains those x positions repeated 4 times.
ex$xcord<-rep(c(0.775,1.225,1.775,2.225,2.775,3.225,3.775,4.225),each=4)
Now in geom_point() use this new column as x values and set position to jitter.
p+geom_bar(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25)+
geom_point(data=ex,aes(xcord,value,color=treat),position=position_jitter(width =.15))+
scale_color_manual(values=c("green","pink"))
As illustrated by holmrenser above, referencing a single dataframe and updating the stat instruction to "summary" in the geom_bar function is more efficient than creating additional dataframes and retaining the stat instruction as "identity" in the code.
To both jitter and dodge the data points with the bar charts per the OP's original question, this can also be accomplished by updating the position instruction in the code with position_jitterdodge. This positioning scheme allows widths for jitter and dodge terms to be customized independently, as follows:
p <- ggplot(ex, aes(cond,value,fill = treat))
p + geom_bar(position = 'dodge', stat = 'summary', fun.y = 'mean') +
geom_errorbar(stat = 'summary', position = 'dodge', width = 0.9) +
geom_point(aes(x = cond), shape = 21, position =
position_jitterdodge(jitter.width = 0.5, jitter.height=0.4,
dodge.width=0.9))

Resources