Spacing between groups of bars in histogram - r

When I produce histograms in ggplot2 where the bar positions are dodge, I expect something like this where there is space between the groups of bars (i.e. notice the white space between each groups of red/green pairs):
I'm having a hard time producing the same effect when I build a histogram with continuous data. I can't seem to add space between the groups of bars, and instead, everything gets squashed together. As you can see, it makes it visually difficult to compare the red/green pairs:
To reproduce my problem, I created a sample data set here: https://www.dropbox.com/s/i9nxzo1cmbwwfsa/data.csv?dl=0
Code to reproduce:
data <- read.csv("https://www.dropbox.com/s/i9nxzo1cmbwwfsa/data.csv?dl=1")
ggplot(data, aes(x = soldPrice, fill = month)) +
geom_histogram(binwidth=1e5, position=position_dodge()) +
labs(x="Sold Price", y="Sales", fill="") +
scale_x_continuous(labels=scales::comma, breaks=seq(0, 2e6, by = 1e5)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
How can I add white space between the groups of red/green pairs?

Alternative 1: overlapping bars with geom_histogram()
From ?position_dodge():
Dodging preserves the vertical position of an geom while adjusting the horizontal position
This function accepts a width argument that determines the space to be created.
To get what I think you want, you need to supply a suitable value to position_dodge(). In your case, where binwidth=1e5, you might play with e.g. 20% of that value: position=position_dodge(1e5-20*(1e3)).
(I left the rest of your code untouched.)
You could use the following code:
ggplot(data, aes(x = soldPrice, fill = month)) +
geom_histogram(binwidth=1e5, position=position_dodge(1e5-20*(1e3))) + ### <-----
labs(x="Sold Price", y="Sales", fill="") +
scale_x_continuous(labels=scales::comma, breaks=seq(0, 2e6, by = 1e5)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
yielding this plot:
Alternative 2: use ggplot-object and render with geom_bar
geom_histogram() was not designed to produce what you want. geom_bar() on the other hand provides the flexibility you need.
You can generate the histogram with geom_histogram and save it in an ggplot-object. Then, you generate the plotting information with ggplot_build(). Now,
you may use the histogram plotting information in the object to generate a bar plot with geom_bar()
## save ggplot object to h
h <- ggplot(data, aes(x = soldPrice, fill = month)) +
geom_histogram(binwidth=1e5, position=position_dodge(1e5-20*(1e3)))
## get plotting information as data.frame
h_plotdata <- ggplot_build(h)$data[[1]]
h_plotdata$group <- as.factor(h_plotdata$group)
levels(h_plotdata$group) <- c("May 2018", "May 2019")
## plot with geom_bar
ggplot(h_plotdata, aes(x=x, y=y, fill = group)) +
geom_bar(stat = "identity") +
labs(x="Sold Price", y="Sales", fill="") +
scale_x_continuous(labels=scales::comma, breaks=seq(0, 2e6, by = 1e5)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
yielding this graph:
Please, let me know whether this is what you want.

Related

How to change the Y range for ggplot (geom_col) in R?

I am trying to create 2 ggplot bar graphs for text analysis to compare frequencies as percentages from the dictionary "loughran". Here is my code for one of the graphs. How can I edit my y range so that both graphs start at 0% and end at 100%? This way, it would be much easier to see the differences.
ggplot(loughran_nc) +
aes(x = fct_reorder(sentiment, perc), y = perc)+
geom_col()+
ylab("Percentage") +
xlab("Sentiment")+
ggtitle("Sentiment Analysis: Non-Complaints Loughran dictionary")+
theme(plot.title = element_text(hjust = 0.5))
you can set limits within coord_cartesian()
Some quick data:
library(tidyverse)
loughran_nc <- data.frame(sentiment = c("words","for","some","data"),perc=c(40,60,20,80))
Then your plot + 1 line:
ggplot(loughran_nc) +
aes(x = fct_reorder(sentiment, perc), y = perc)+
geom_col()+
ylab("Percentage") +
xlab("Sentiment")+
ggtitle("Sentiment Analysis: Non-Complaints Loughran dictionary")+
theme(plot.title = element_text(hjust = 0.5)) +
coord_cartesian(ylim = c(0,100))
An alternative to coord_cartesian() is to use scale_y_continuous() or ylim().
scale_y_continuous() lets you specify all sorts of attributes to the y axis; limits, breaks, name etc (see ?scale_y_continuous). For your example, you can add scale_y_continuous(limits = c(0, 100)) to your code
ylim() is simple, and adding ylim(c(0, 100)) would also do the same job

How to properly form ggplot graphs, without cutting off important parts of the graph?

I have created a barchart using ggplot() + geom_bar() functions, by ggplot2 package. I have also used coord_flip() to reverse the orientation of the bars and geom_text() to add the values at the top of each bar. Some of the bars have different colors, so there is a legend following the graph. What I am getting as result is a picture half occupied by the graph, half by the legend and with the values on top of the longest bars being cut off because of the small size of the graph.
Any ideas on how to enlarge the size of the graph and reduce the size of the legend, in order the values of the bars not to be cut off?
Thank you
This is my code on imaginary data:
labels <- c("A","B","C","D","E")
freq <- c(10.3678, 5.84554, 1.5673, 2.313, 7.111)
df <- as.data.frame(cbind(labels,freq))
type <- c("rich","poor","poor","poor","rich")
library(ggplot2)
ggplot(df, aes(x = reorder(labels,freq), y= freq, fill = type)) +
geom_bar(stat = "identity", alpha = 1, width = 0.9)+
coord_flip()+
xlab("")+
ylab("Mean frequency")+
scale_fill_manual(name = "Type", values = c("red", "blue")) +
ggtitle("Mean frequency of different labels")+
geom_text(label = sort(freq, decreasing = FALSE), size = 3.5, hjust = -0.2)
And this is the graph it gives as result:
There are a few fixes to this:
Change your Limits
As indicated by #Dave2e - see his response
Change the size of your output
The interesting thing about graphics in R is that the aspect ratio and resolution of the graphics device will change the result and look of a plot. When I ran your code... no clipping was observed. You can test this out creating the plot and then saving differently. If I take your default code, here's what I get with different arguments to width= and height= for ggsave() as a png:
ggsave('a1.png', width=10, height=5)
ggsave('a2.png', width=15, height=5)
Set an Expansion
The third way is to set an expansion to the scale limits. By default, ggplot2 actually adds some "padding" to the ends of a scale. So, if you set your limits from 0 to 10, you'll actually have a plot area that goes a bit beyond this (about 5% beyond by default). You can redefine that setting by using the expand= argument of scale_... commands in ggplot. So you can set this limit, for example in the following code:
labels <- c("A","B","C","D","E")
freq <- c(10.3678, 5.84554, 1.5673, 2.313, 7.111)
type <- c("rich","poor","poor","poor","rich")
df <- data.frame(labels, freq, type)
library(ggplot2)
ggplot(df, aes(x = reorder(labels,freq), y= freq, fill = type)) +
geom_bar(stat = "identity", alpha = 1, width = 0.9)+
coord_flip()+
xlab("")+
ylab("Mean frequency")+
scale_fill_manual(name = "Type", values = c("red", "blue")) +
ggtitle("Mean frequency of different labels")+
geom_text(label = freq, size = 3.5, hjust = -0.2) +
scale_y_continuous(expand=expansion(mult=c(0,0.15)))
You can define the lower and upper expansion for an axis, so in the above code I've defined to set no expansion to the lower limit of the y scale and to use a multiplier of 0.15 (about 15%) to the upper limit. Default is 0.05, I believe (or 5%).
You can override the default limits on the y axis scale with with the ylim() function.
labels <- c("A","B","C","D","E")
freq <- c(10.3678, 5.84554, 1.5673, 2.313, 7.111)
type <- c("rich","poor","poor","poor","rich")
df <- data.frame(labels, freq, type)
#set the max y axis limit to allow enough room for the label
ylimitmax <- 11
library(ggplot2)
ggplot(df, aes(x = reorder(labels,freq), y= freq, fill = type)) +
geom_bar(stat = "identity", alpha = 1, width = 0.9)+
coord_flip()+
xlab("")+
ylab("Mean frequency")+
scale_fill_manual(name = "Type", values = c("red", "blue")) +
ggtitle("Mean frequency of different labels")+
ylim(0, ylimitmax) +
geom_text(label = freq, size = 3.5, hjust = -0.2)
The script shows how to code the manual limits but you may want to automate the limit calculation with something like ylimitmax= max(freq) * 1.2.

How do I use facetting correctly in ggplot geom_tile, while keeping the aspect ratio intact?

I am trying to create a 'likeliness plot' intended to quickly show an items likeliness vs other items in a table.
A quick example:
'property_data.csv' file to use:
"","Country","Town","Property","Property_value"
"1","UK","London","Road_quality","Bad"
"2","UK","London","Air_quality","Very bad"
"3","UK","London","House_quality","Average"
"4","UK","London","Library_quality","Good"
"5","UK","London","Pool_quality","Average"
"6","UK","London","Park_quality","Bad"
"7","UK","London","River_quality","Very good"
"8","UK","London","Water_quality","Decent"
"9","UK","London","School_quality","Bad"
"10","UK","Liverpool","Road_quality","Bad"
"11","UK","Liverpool","Air_quality","Very bad"
"12","UK","Liverpool","House_quality","Average"
"13","UK","Liverpool","Library_quality","Good"
"14","UK","Liverpool","Pool_quality","Average"
"15","UK","Liverpool","Park_quality","Bad"
"16","UK","Liverpool","River_quality","Very good"
"17","UK","Liverpool","Water_quality","Decent"
"18","UK","Liverpool","School_quality","Bad"
"19","USA","New York","Road_quality","Bad"
"20","USA","New York","Air_quality","Very bad"
"21","USA","New York","House_quality","Average"
"22","USA","New York","Library_quality","Good"
"23","USA","New York","Pool_quality","Average"
"24","USA","New York","Park_quality","Bad"
"25","USA","New York","River_quality","Very good"
"26","USA","New York","Water_quality","Decent"
"27","USA","New York","School_quality","Bad"
Code:
prop <- read.csv('property_data.csv')
Property_col_vector <- c("NA" = "#e6194b",
"Very bad" = "#e6194B",
"Bad" = "#ffe119",
"Average" = "#bfef45",
"Decent" = "#3cb44b",
"Good" = "#42d4f4",
"Very good" = "#4363d8")
plot_likeliness <- function(town_property_table){
g <- ggplot(town_property_table, aes(Property, Town)) +
geom_tile(aes(fill = Property_value, width=.9, height=.9)) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5),
strip.text.y = element_text(angle = 0)) +
scale_fill_manual(values = Property_col_vector) +
coord_fixed()
return(g)
}
summary_town_plot <- plot_likeliness(prop)
Output:
This is looking great!
Now I've created a plot that looks nice because I used the coord_fixed() function, but now I want to create the same plot, facetted by Country.
To do this I created the following function:
plot_likeliness_facetted <- function(town_property_table){
g <- ggplot(town_property_table, aes(Property, Town)) +
geom_tile(aes(fill = Property_value, width=.9, height=.9)) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5),
strip.text.y = element_text(angle = 0)) +
scale_fill_manual(values = Property_col_vector) +
facet_grid(Country ~ .,
scale = 'free_y')
return(g)
}
facetted_town_plot <- plot_likeliness_facetted(prop)
facetted_town_plot
Result:
However, now my tiles are stretched and if i try to use '+ coords_fixed()' I get the error:
Error: coord_fixed doesn't support free scales
How can I get the plot to facet, but maintain the aspect ratio ? Please note that I'm plotting these in a series, so hardcoding the heights of the plot with manual values is not a solution I'm after, I need something that dynamically scales with the amount of values in the table.
Many thanks for any help!
Edit: Although the same question was asked in slightly different context elsewhere, it had multiple answers with none marked as solving the question.
theme(aspect.ratio = 1) and space = 'free' seems to work.
plot_likeliness_facetted <- function(town_property_table){
g <- ggplot(town_property_table, aes(Property, Town)) +
geom_tile(aes(fill = Property_value, width=.9, height=.9)) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5),
strip.text.y = element_text(angle = 0), aspect.ratio = 1) +
scale_fill_manual(values = Property_col_vector) +
facet_grid(Country ~ .,
scale = 'free_y', space = 'free')
return(g)
}
This might not be a perfect answer, but I'm going to give it a spin anyway. Basically, it is going to be difficult to do this with base ggplot because -as you mentioned- coord_fixed() or theme(aspect.ratio = ...) don't play nice with facets.
The first solution I'll propose, is to use gtables to programatically set the width of panels to match the number of variables on your x-axis:
plot_likeliness_gtable <- function(town_property_table){
g <- ggplot(town_property_table, aes(Property, Town)) +
geom_tile(aes(fill = Property_value, width=.9, height=.9)) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5),
strip.text.y = element_text(angle = 0)) +
scale_fill_manual(values = Property_col_vector) +
facet_grid(Country ~ .,
scale = 'free_y', space = "free_y")
# Here be the gtable bits
gt <- ggplotGrob(g)
# Find out where the panel is stored in the x-direction
panel_x <- unique(gt$layout$l[grepl("panel", gt$layout$name)])[1]
# Set that width based on the number of x-axis variables, plus 0.2 because
# of the expand arguments in the scales
gt$widths[panel_x] <- unit(nlevels(droplevels(town_property_table$Property)) + 0.2, "null")
# Respect needs to be true to have 'null' units match in x- and y-direction
gt$respect <- TRUE
return(gt)
}
Which would work in the following way:
library(grid)
x <- plot_likeliness_gtable(prop)
grid.newpage(); grid.draw(x)
And gives this plot:
This all works reasonably well but at this point, it would probably be good to discuss some of the drawbacks of having gtables instead of ggplot objects. First, you can't edit it anymore with ggplot, so you can't add another + geom_myfavouriteshape() or anything of the sort. You could still edit parts of the plot in gtable/grid though. Second, it has the quirky grid.newpage(); grid.draw() syntax, which needs the grid library. Third, we're kind of relying on the ggplot facetting to set the y-direction panel heights correctly (2.2 and 1.2 null-units in your example) while this might not be appropriate in all cases. On the upside, you're still defining dimensions in flexible null-units, so it'll scale pretty well with whatever plotting device you're using.
The second solution I'll propose could be a bit hacky for many a taste, but it'll take away the first two drawbacks of using gtables. Some time ago, I had similar issues with the weird panel size behaviour when facetting, so I wrote these functions to set panel sizes. The essence of what is does is to copy the panel drawing function from whatever plot you're making and wrap it inside a new function that sets the panel sizes to some pre-defined numbers. It has to be called after any facetting function though. It would work like this:
plot_likeliness_forcedsizes <- function(town_property_table){
g <- ggplot(town_property_table, aes(Property, Town)) +
geom_tile(aes(fill = Property_value, width=.9, height=.9)) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5),
strip.text.y = element_text(angle = 0)) +
scale_fill_manual(values = Property_col_vector) +
facet_grid(Country ~ .,
scale = 'free_y', space = "free_y") +
force_panelsizes(cols = nlevels(droplevels(town_property_table$Property)) + 0.2,
respect = TRUE)
return(g)
}
myplot <- plot_likeliness_forcedsizes(prop)
myplot
It still relies on ggplot setting the y-direction heights correctly though, but you could override these within force_panelsizes() if things go awry.
Hope this helped, good luck!

Axis tranformation in ggplot - How do I change the scale on a specific interval?

I've made a violin plot that looks like this:
As we can see most of the data lies near the region where the score is 0.90-0.95. What I wish is to focus on the interval 0.75 to 1.00 by changing the scale giving less space to ratings from 0 to 0.75.
Is there a way to do this?
This is the code I'm currently using to create the violin plot:
ggplot(data=Violin_plots, aes(x = Year, y = Score)) +
geom_violin(aes(fill = Violin_plots$Year), trim = TRUE) +
coord_flip()+
scale_fill_brewer(palette = "Blues") +
theme(legend.position = 'none') +
labs(y = "Rating score",
fill = "Rating year",
title = "Violin-plots of credit rating scores")
While it's possible to transform the scale to focus more in the upper region (e.g. add trans = "exp" as an argument to the scale), a non linear scale is often hard to interpret appropriately.
For such use cases, I recommend facet_zoom from the ggforce package, which is pretty much built for this exact purpose (see vignette here).
I also switched from geom_violin() + coord_flip() to geom_violinh from the ggstance package, which extends ggplot2 by providing flipped versions of ggplot components. Example with simulated data below:
library(ggforce) # for facet_zoom
library(ggstance) # for flipped version of geom_violin
ggplot(df,
aes(x = rating, y = year, fill = year)) +
geom_violinh() + # no need to specify trim = TRUE as it's the default
scale_fill_brewer(palette = "Blues") +
theme(legend.position = 'none') +
facet_zoom(xlim = c(0.75, 0.98)) # specify zoom range here
Sample data that simulates the characteristics of the data in the question:
df <- diamonds[, c("color", "price")]
df$rating <- (max(df$price) - df$price) / max(df$price)
df$year <- df$color
You could create a second plot to zoom in on the original plot, without modifying the data, by using ggplot2::coord_cartesian()
ggplot(data=Violin_plots, aes(x=Year,y=Score*100)) +
geom_violin(aes(fill=Violin_plots$Year),trim=TRUE) +
coord_flip() +
coord_cartesian(xlim = c(0.75, 1.00)) +
scale_fill_brewer(palette="Blues") +
theme(legend.position='none') +
labs(y="Rating score",fill="Rating year",title="Violin-plots of credit rating scores")

How to set automatic label position based on box height

In a previous question, I asked about moving the label position of a barplot outside of the bar if the bar was too small. I was provided this following example:
library(ggplot2)
options(scipen=2)
dataset <- data.frame(Riserva_Riv_Fine_Periodo = 1:10 * 10^6 + 1,
Anno = 1:10)
ggplot(data = dataset,
aes(x = Anno,
y = Riserva_Riv_Fine_Periodo)) +
geom_bar(stat = "identity",
width=0.8,
position="dodge") +
geom_text(aes( y = Riserva_Riv_Fine_Periodo,
label = round(Riserva_Riv_Fine_Periodo, 0),
angle=90,
hjust= ifelse(Riserva_Riv_Fine_Periodo < 3000000, -0.1, 1.2)),
col="red",
size=4,
position = position_dodge(0.9))
And I obtain this graph:
The problem with the example is that the value at which the label is moved must be hard-coded into the plot, and an ifelse statement is used to reposition the label. Is there a way to automatically extract the value to cut?
A slightly better option might be to base the test and the positioning of the labels on the height of the bar relative to the height of the highest bar. That way, the cutoff value and label-shift are scaled to the actual vertical range of the plot. For example:
ydiff = max(dataset$Riserva_Riv_Fine_Periodo)
ggplot(dataset, aes(x = Anno, y = Riserva_Riv_Fine_Periodo)) +
geom_bar(stat = "identity", width=0.8) +
geom_text(aes(label = round(Riserva_Riv_Fine_Periodo, 0), angle=90,
y = ifelse(Riserva_Riv_Fine_Periodo < 0.3*ydiff,
Riserva_Riv_Fine_Periodo + 0.1*ydiff,
Riserva_Riv_Fine_Periodo - 0.1*ydiff)),
col="red", size=4)
You would still need to tweak the fractional cutoff in the test condition (I've used 0.3 in this case), depending on the physical size at which you render the plot. But you could package the code into a function to make the any manual adjustments a bit easier.
It's probably possible to automate this by determining the actual sizes of the various grobs that make up the plot and setting the condition and the positioning based on those sizes, but I'm not sure how to do that.
Just as an editorial comment, a plot with labels inside some bars and above others risks confusing the visual mapping of magnitudes to bar heights. I think it would be better to find a way to shrink, abbreviate, recode, or otherwise tweak the labels so that they contain the information you want to convey while being able to have all the labels inside the bars. Maybe something like this:
library(scales)
ggplot(dataset, aes(x = Anno, y = Riserva_Riv_Fine_Periodo/1000)) +
geom_col(width=0.8, fill="grey30") +
geom_text(aes(label = format(Riserva_Riv_Fine_Periodo/1000, big.mark=",", digits=0),
y = 0.5*Riserva_Riv_Fine_Periodo/1000),
col="white", size=3) +
scale_y_continuous(label=dollar, expand=c(0,1e2)) +
theme_classic() +
labs(y="Riserva (thousands)")
Or maybe go with a line plot instead of bars:
ggplot(dataset, aes(Anno, Riserva_Riv_Fine_Periodo/1e3)) +
geom_line(linetype="11", size=0.3, colour="grey50") +
geom_text(aes(label=format(Riserva_Riv_Fine_Periodo/1e3, big.mark=",", digits=0)),
size=3) +
theme_classic() +
scale_y_continuous(label=dollar, expand=c(0,1e2)) +
expand_limits(y=0) +
labs(y="Riserva (thousands)")

Resources