How to align the bars of a histogram with the x axis? - r

Consider this simple example
library(ggplot2)
dat <- data.frame(number = c(5, 10, 11 ,12,12,12,13,15,15))
ggplot(dat, aes(x = number)) + geom_histogram()
See how the bars are weirdly aligned with the x axis? Why is the first bar on the left of 5.0 while the bar at 10.0 is centered? How can I get control over that? For instance, it would make more sense (to me) to have the bar starting on the right of the label.

Why are the bars "weirdly aligned"?
Let me start by explaining, why your code leads to weirdly aligned bars. This has to do with the way a histogram is constructed. First, the x-axis is split up into intervals and then, the number of values in each interval is counted.
By default, ggplot splits the data up into 30 bins. It even spits out a message that says so:
stat_bin() using bins = 30. Pick better value with binwidth.
The default number of is not always a good choice. In your case, where all the data points are integers, one might want to choose the boundaries of the bins as 5, 6, 7, 8, ... or 4.5, 5.5, 6.5, ..., such that each bin contains exactly one integer value. You can obtain the boundaries of the bins that have been used in the plot as follows:
data <- data.frame(number = c(5, 10, 11 ,12, 12, 12, 13, 15, 15))
p <- ggplot(data, aes(x = number)) + geom_histogram()
ggplot_build(p)$data[[1]]$xmin
## [1] 4.655172 5.000000 5.344828 5.689655 6.034483 6.379310 6.724138 7.068966 7.413793
## [10] 7.758621 8.103448 8.448276 8.793103 9.137931 9.482759 9.827586 10.172414 10.517241
## [19] 10.862069 11.206897 11.551724 11.896552 12.241379 12.586207 12.931034 13.275862 13.620690
## [28] 13.965517 14.310345 14.655172
As you can see, the boundaries of the bins are not chosen in a way that would lead to a nice alignment of the bars with integers.
So, in short, the reason for the weird alignment is that ggplot simply uses a default number of 30 bins, which is not suitable, in your case, to have bars that are nicely aligned with integers.
There are (at least) two ways to get nicely aligned bars that I will discuss in the following
Use a bar plot instead
Since you have integer data, a histogram may just not be the appropriate choice of visualisation. You could instead use geom_bar(), which will lead to bars that are centered on integers:
ggplot(data, aes(x = number)) + geom_bar() + scale_x_continuous(breaks = 1:16)
You could move the bars to the right of the integers by adding 0.5 to number:
ggplot(data, aes(x = number + 0.5)) + geom_bar() + scale_x_continuous(breaks = 1:16)
Create a histogram with appropriate bins
If you nevertheless want to use a histogram, you can make ggplot to use more reasonable bins as follows:
ggplot(data, aes(x = number)) +
geom_histogram(binwidth = 1, boundary = 0, closed = "left") +
scale_x_continuous(breaks = 1:16)
With binwidth = 1, you override the default choice of 30 bins and explicitly require that bins should have a width of 1. boundary = 0 ensures that the binning starts at an integer value, which is what you need, if you want the integers to be to the left of the bars. (If you omit it, bins are chosen such that the bars are centered on integers.)
The argument closed = "left" is a bit more tricky to explain. As I described above, the boundaries of the bins are now chosen to be 5, 6, 7, .... The question is now, in which bin, e.g., 6 should be? It could be either the first or second one. This is the choice that is controlled by closed: if you set it to "right" (the default), then the bins are closed on the right, meaning that the right boundary of the bin will be included, while the left boundary belongs to the bin to the left. So, 6 would be in the first bin. On the other hand, if you chose "left", the left boundary will be part of the bin and 6 would be in the second bin.
Since you want the bars to be to the left of the integers, you need to pick closed = "left".
Comparison of the two solutions
If you compare the histogram with the bar plot, you will notice two differences:
There is a little gap between the bars in the bar plot, while they touch in the histogram. You could make the bars touch in the former by using geom_bar(width = 1).
The right most bar is between 15 and 16 for the bar plot, while it is between 14 and 15 for the histogram. The reason is that while for all the bins only the left boundary is part of the bin, for the right most bin, both boundaries are included.

This will center the bar on the value
data <- data.frame(number = c(5, 10, 11 ,12,12,12,13,15,15))
ggplot(data,aes(x = number)) + geom_histogram(binwidth = 0.5)
Here is a trick with the tick label to get the bar align on the left..
But if you add other data, you need to shift them also
ggplot(data,aes(x = number)) +
geom_histogram(binwidth = 0.5) +
scale_x_continuous(
breaks=seq(0.75,15.75,1), #show x-ticks align on the bar (0.25 before the value, half of the binwidth)
labels = 1:16 #change tick label to get the bar x-value
)
other option: binwidth = 1, breaks=seq(0.5,15.5,1) (might make more sense for integer)

On top of #Stibu's great answer, note that since ggplot2 3.4.0, geom_col and geom_bar can now take a new just argument to place the bars / cols to the left or right of the x-axis. 0.5 (the default) will place the columns in the center, 0 on the right, and 1 on the left:
library(patchwork)
library(ggplot2)
plot1 <- ggplot(dat, aes(x = number)) +
geom_bar(just = 0) +
labs(title = "with just = 0") +
scale_x_continuous(breaks = 1:16)
plot2 <- ggplot(dat, aes(x = number)) +
geom_bar(just = 1) +
labs(title = "with just = 1") +
scale_x_continuous(breaks = 1:16)
plot1 + plot2

This worked for me
+ scale_x_continuous(limits = c(0, NA))
From ?scale_x_continuous, limits is:
One of:
NULL to use the default scale range
A numeric vector of length two providing limits of the scale. Use NA
to refer to the existing minimum or maximum
A function that accepts the existing (automatic) limits and returns
new limits Note that setting limits on positional scales will remove
data outside of the limits. If the purpose is to zoom, use the limit
argument in the coordinate system (see coord_cartesian()).

library(ggplot2)
dat <- data.frame(number = c(5, 10, 11 ,12,12,12,13,15,15))
#I have added bins=10 to control too many bins, by default it takes 30
#then it is difficult to read the labels
p1 <- ggplot(dat, aes(x = number)) + geom_histogram(bins = 10, color="black")
#use ggplot_build to get access to bin details, subsetting to [5] is used to
#get max of each bin, you can use 3 to get centre, 4 to get left edge etc
#to see all the coponent of this chart, you can just run
#ggplot_build(p1)$data[[1]]
binDetails <- round(ggplot_build(p1)$data[[1]][5], digits = 3)
Scalexx <- scale_x_continuous(breaks = binDetails$xmax)
#final chart
p1+Scalexx
Please visit below link to see the same method as video and upvote if it helps:
https://www.youtube.com/watch?v=Za8bTDvmPLk
By using this method, we do not need to count the bin details manually. Please comment if any questions.

Related

geom_step starting and ending with a horizontal segment

Sometimes I'd like to present data that refer to periods (not to points in time) as a step function. When e.g. data are per-period averages, this seems more appropriate than using a line connecting points (with geom_line). Consider, as a MWE, the following:
df = data.frame(x=1:8,y=rnorm(8,5,2))
ggplot(df,aes(x=x,y=y))+geom_step(size=1)+scale_x_continuous(breaks=seq(0,8,2))
This gives
However, the result is not fully satisfactory, as (1) I'd like the final observation to be represented by an horizontal segment and (2) I'd like to have labels on the x-axis aligned at the center of the horizontal line. What I want can be obtained with some hacking:
df %>% rbind(tail(df,1) %>% mutate(x=x+1)) %>%
ggplot(aes(x,y))+geom_step(size=1)+
scale_x_continuous(breaks=seq(0,12,2))+
theme(axis.ticks.x=element_blank(),axis.text.x=element_text(hjust=-2))
which produces:
This corresponds to what I am looking for (except that the horizontal alignment of labels requires some fine tuning and is not perfect). However, I am not sure this is the best way to proceed and I wonder if there is a better way.
Does this work for you? It comes down to altering the data as it is passed rather than changing the plotting code per se (as is often the case in ggplot)
Essentially what we do is add an extra copy of the final y value on to the end of the data frame at an incremented x value.
To make the horizontal segments line up to the major axis breaks, we simply subtract 0.5 from the x value.
ggplot(rbind(df, data.frame(x = 9, y = tail(df$y, 1))),
aes(x = x - 0.5, y = y)) +
geom_step(size = 1)+
scale_x_continuous(breaks = seq(0, 8, 2), name = "x",
minor_breaks = seq(0, 8, 1) + 0.5) +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor = element_line())

Why histogram from `ggplot()` is same as for only for one variable used for `aes` `fill`?

The question is about two observations related to following 3 figures:
(1) Why the histograms in (a) and (b) are different if number of bins is same?
(2) Histogram in (b) is exactly same as the histogram for the fill nonsmo. If this is the case then how to make histogram of complete data using ggplot()?
(a) Plot using hist(chol$AGE,30).
(b) Histogram plotted with ggplot(data=chol, aes(chol$AGE)) + geom_histogram() and default values i.e. 30 bins.
(c) Now adding fill with respect to the variable SMOKE:
ggplot(data=chol, aes(chol$AGE)) +
geom_histogram(aes(fill = chol$SMOKE))
Here is what I did after comments by #Dave2e
ggplot(data=chol, aes(AGE, fill = SMOKE)) +
geom_histogram(aes(y = ..count..), binwidth = 1, position = "stack")
hist(chol$AGE, breaks = 30, right = FALSE)
Adding correct value for binwidth, realizing by default position is stack and using right as false got exactly same histograms.
Most likely there are a large number of values matching the bins upper and lower limits so depending on the preferences, of whether it is left-open or right-open there could be a significant shift in bins.
For example compare:
set.seed(10)
age<-as.integer(rnorm(100, 50, 20))
par(mfrow=c(2, 1))
hist(age, 30, right=TRUE)
hist(age, 30, right=FALSE)
Notice, only about 18 bins were created (bin width of 5)
With ggplot2, where the bins are shifted to the center of the bin range:
library(ggplot2)
ggplot(data.frame(age), aes(age)) +geom_histogram()

Realigning x-axis on geom_histogram in ggplot2

When creating a geom_histogram in ggplot, the bin labels appear directly underneath the bars. How can I make it so that they appear on either side of the bin, so that they describe the range of each bin (so that the bin that includes cases from 0 to 10 will appear between the 0 and 10 labels)?
I tried using
geom_histogram(position=position_nudge(5))
However, the histogram I'm using is stacked (to differentiate categories within each bin), and this effect is ruined when I add this position. Is there another way of doing it? Maybe moving the axis labels themselves instead of the bars?
Reproducible code:
dd<-data.frame(nums=c(1:20,15:30,40:55),cats=c(rep("a",20),rep("b",30),rep("c",2)))
ggplot(dd, aes(nums))+geom_histogram(aes(nums,fill=cats),dd,binwidth = 10)
results in this:
I want the bars to be shifted to the right by 5, so that the 0 aligns with the left-hand side of the histogram
You can try to define breaks and labels
n <- 10
ggplot(dd, aes(nums, fill=cats)) +
geom_histogram(binwidth = n, boundary = 0) +
scale_x_continuous(breaks = seq(0,55,n), labels = seq(0,55, n))
The following moves the labels of the axis. I wasn't sure how to move the ticks on the x axis so I removed them.
ggplot(dd, aes(nums))+geom_histogram(aes(nums),dd,binwidth = 10)+
theme(axis.text.x = element_text(hjust = 5),
axis.ticks.x = element_blank())

ggrepel: Repelling text in only one direction, and returning values of repelled text

I have a dataset, where each data point has an x-value that is constrained (represents an actual instance of a quantitative variable), y-value that is arbitrary (exists simply to provide a dimension to spread out text), and a label. My datasets can be very large, and there is often text overlap, even when I try to spread the data across the y-axis as much as possible.
Hence, I am trying to use the new ggrepel. However, I am trying to keep the text labels constrained at their x-value position, while only allowing them to repel from each other in the y-direction.
As an example, the below code produces an plot for 32 data points, where the x-values show the number of cylinders in a car, and the y-values are determined randomly (have no meaning but to provide a second dimension for text plotting purposes). Without using ggrepel, there is significant overlap in the text:
library(ggrepel)
library(ggplot2)
set.seed(1)
data = data.frame(x=runif(100, 1, 10),y=runif(100, 1, 10),label=paste0("label",seq(1:100)))
origPlot <- ggplot(data) +
geom_point(aes(x, y), color = 'red') +
geom_text(aes(x, y, label = label)) +
theme_classic(base_size = 16)
I can remedy the text overlap using ggrepel, as shown below. However, this changes not only the y-values, but also the x-values. I am trying to avoid changing the x-values, as they represent an actual physical meaning (the number of cylinders):
repelPlot <- ggplot(data) +
geom_point(aes(x, y), color = 'red') +
geom_text_repel(aes(x, y, label = label)) +
theme_classic(base_size = 16)
As a note, the reason I cannot allow the x-value of the text to change is because I am only plotting the text (not the points). Whereas, it seems that most examples in ggrepel keep the position of the points (so that their values remain true), and only repel the x and y values of the labels. Then, the points and connected to the labels with segments (you can see that in my second plot example).
I kept the points in the two examples above for demonstration purposes. However, I am only retaining the text (and hence will be removing the points and the segments), leaving me with something like this:
repelPlot2 <- ggplot(data) + geom_text_repel(aes(x, y, label = label), segment.size = 0) + theme_classic(base_size = 16)
My question is two fold:
1) Is it possible for me to repel the text labels only in the y-direction?
2) Is it possible for me to obtain a structure containing the new (repelled) y-values of the text?
Thank you for any advice!
ggrepel version 0.6.8 (Install from GitHub using devtools::github_install) now supports a "direction" argument, which enables repelling of labels only in "x" or "y" direction.
repelPlot2 <- ggplot(data) + geom_text_repel(aes(x, y, label = label), segment.size = 0, direction = "y") + theme_classic(base_size = 16)
Getting the y values is harder -- one approach can be to use the "repel_boxes" function from ggrepel first to get repelled values and then input those into ggplot with geom_text. For discussion and sample code of that approach, see https://github.com/slowkow/ggrepel/issues/24. Note that if using the latest version, the repel_boxes function now also has a "direction" argument, which takes in "both","x", or "y".
I don't think it is possible to repel text labels only in one direction with ggrepel.
I would approach this problem differently, by instead generating the arbitrary y-axis positions manually. For example, for the data set in your example, you could do this using the code below.
I have used the dplyr package to group the data set by the values of x, and then created a new column of data y containing the row numbers within each group. The row numbers are then used as the values for the y-axis.
library(ggplot2)
library(dplyr)
data <- data.frame(x = mtcars$cyl, label = paste0("label", seq(1:32)))
data <- data %>%
group_by(x) %>%
mutate(y = row_number())
ggplot(data, aes(x = x, y = y, label = label)) +
geom_text(size = 2) +
xlim(3.5, 8.5) +
theme_classic(base_size = 8)
ggsave("filename.png", width = 4, height = 2)

geom_text() size definitions in ggplot2

I'm trying to vary the size of a geom_text() layer in a ggplot so that the labels are always narrower than a given range. The ranges are defined in the data, but what I don't know is how to scale the label to be narrower than that, without a ton of trial and error.
What I hope is that I can construct a function of label size and nchar(label) (realizing character width varies a bit) that would return a width that I could compare to the shape width, and scale down until no longer necessary.
Are the ggplot label sizes defined as a number of pixels, percentage of the plot height, or something else like that?
would this be a helpful place to start? (if not please feel free to delete my post). You add your ranges to ranges = rnorm(foo, 5, 1).
library(ggplot2)
library(directlabels)
set.seed(67)
foo <- 8
df <- data.frame(x = rnorm(foo, 1, .5), y=rnorm(foo, 1, .5), ranges = rnorm(foo, 5, 1), let=letters[1:foo])
p <- ggplot(df, aes(x, y, color=let)) + geom_point() + scale_colour_brewer(palette=5)
direct.label(p,
list("top.points", rot=0, cex=df[,3],
fontface="bold", fontfamily="serif", alpha=0.8))

Resources