Remove baseline color for geom_histogram - r

I'm adding a color aesthetic to a faceted histogram. In the reprex below, with no color aesthetic, the histogram only show data within that facet level. However, with color defined, a baseline is added which stretches the stretches to include the range of data across all facets. Is there a way to make this not happen?
I'm looking for something similar to geom_density with trim = TRUE, but there doesn't appear to be a trim option for geom_histogram.
library(tidyverse)
data <- tibble(a = rchisq(1000, df = 3),
b = rchisq(1000, df = 1),
c = rchisq(1000, df = 10)) %>%
gather()
ggplot(data, aes(x = value)) +
geom_histogram() +
facet_wrap(~ key, ncol = 1)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(x = value)) +
geom_histogram(color = "red") +
facet_wrap(~ key, ncol = 1)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(x = value)) +
geom_density(color = "red", trim = TRUE) +
facet_wrap(~ key, ncol = 1)
Created on 2019-07-20 by the reprex package (v0.3.0)

geom_histogram draws its bars using using rectGrob from the grid package, and a zero-width / zero-height rectGrob is depicted as a vertical / horizontal line in the outline colour, at least in my set-up for RStudio (& OP's as well, I presume). Demonstration below:
library(grid)
r1 <- rectGrob(width = unit(0, "npc"), gp = gpar(col = "red", fill = "grey")) # zero-width
r2 <- rectGrob(height = unit(0, "npc"), gp = gpar(col = "red", fill = "grey")) # zero-height
grid.draw(r1) # depicted as a vertical line, rather than disappear completely
grid.draw(r2) # depicted as a horizontal line, rather than disappear completely
In this case, if we check the data frame associated with the histogram layer, there are many rows with ymin = ymax = 0, which are responsible for the 'baseline' effect seen in the question.
p <- ggplot(data, aes(x = value)) +
geom_histogram(color = "red") +
facet_wrap(~ key, ncol = 1)
View(layer_data(p) %>% filter(PANEL == 2)) # look at the data associated with facet panel 2
Workaround: Since the data calculations are done in StatBin's compute_group function, we can define an alternative version of the same function, with an additional step to drop the 0-count rows from the data frame completely:
# modified version of StatBin2 inherits from StatBin, except for an
# additional 2nd last line in compute_group() function
StatBin2 <- ggproto(
"StatBin2",
StatBin,
compute_group = function (data, scales, binwidth = NULL, bins = NULL,
center = NULL, boundary = NULL,
closed = c("right", "left"), pad = FALSE,
breaks = NULL, origin = NULL, right = NULL,
drop = NULL, width = NULL) {
if (!is.null(breaks)) {
if (!scales$x$is_discrete()) {
breaks <- scales$x$transform(breaks)
}
bins <- ggplot2:::bin_breaks(breaks, closed)
}
else if (!is.null(binwidth)) {
if (is.function(binwidth)) {
binwidth <- binwidth(data$x)
}
bins <- ggplot2:::bin_breaks_width(scales$x$dimension(), binwidth,
center = center, boundary = boundary,
closed = closed)
}
else {
bins <- ggplot2:::bin_breaks_bins(scales$x$dimension(), bins,
center = center, boundary = boundary,
closed = closed)
}
res <- ggplot2:::bin_vector(data$x, bins, weight = data$weight, pad = pad)
# drop 0-count bins completely before returning the dataframe
res <- res[res$count > 0, ]
res
})
Usage:
ggplot(data, aes(x = value)) +
geom_histogram(color = "red", stat = StatBin2) + # specify stat = StatBin2
facet_wrap(~ key, ncol = 1)

Another option could be using after_stat on your y aes with an ifelse to check if the mapped value is higher than 0 otherwise replace the value with NA which will make it possible to remove the baseline color like this:
library(tidyverse)
ggplot(data, aes(x = value, y = ifelse(after_stat(count) > 0, after_stat(count), NA))) +
geom_histogram(color = "red") +
facet_wrap(~ key, ncol = 1)
Created on 2023-02-15 with reprex v2.0.2

Related

Shade parts of a ggplot based on a (changing) dummy variable

I want to shade areas in a ggplot but I don't want to manually tell geom_rect() where to stop and where to start. My data changes and I always want to shade several areas based on a condition.
Here for example with the condition "negative":
library("ggplot2")
set.seed(3)
plotdata <- data.frame(somevalue = rnorm(10), indicator = 0 , counter = 1:10)
plotdata[plotdata$somevalue < 0,]$indicator <- 1
plotdata
I can do that manually like here or here:
plotranges <- data.frame(from = c(1,4,9), to = c(2,4,9))
ggplot() +
geom_line(data = plotdata, aes(x = counter, y = somevalue)) +
geom_rect(data = plotranges, aes(xmin = from - 1, xmax = to, ymin = -Inf, ymax = Inf), alpha = 0.4)
But my problem is that, so to speak, the set.seed() argument changes and I want to still automatically generate the plot without specifying min and max values of the shading manually. Is there a way (maybe without geom_rect() but instead geom_bar()?) to plot shading based directly on my indicator variable?
edit: Here is my own best attempt, as you can see not great:
ggplot(data = plotdata, aes(x = counter, y = somevalue)) + geom_line() +
geom_bar(aes(y = indicator*max(somevalue)), stat= "identity")
You can use stat_summary() to calculate the extremes of runs of your indicator. In the code below data.table::rleid() groups the data by runs of indicators. In the summary layer, y doesn't really do anything, so we use it to store the resolution of your datapoints, which we then later use to offset the xmin/xmax parameters. The after_stat() function is used to access computed variables after the ranges have been computed.
library("ggplot2")
plotdata <- data.frame(somevalue = rnorm(10), counter = 1:10)
plotdata$indicator <- as.numeric(plotdata$somevalue < 0)
ggplot(plotdata, aes(counter, somevalue)) +
stat_summary(
geom = "rect",
aes(group = data.table::rleid(indicator),
xmin = after_stat(xmin - y * 0.5),
xmax = after_stat(xmax + y * 0.5),
alpha = I((indicator) * 0.4),
y = resolution(counter)),
fun.min = min, fun.max = max,
orientation = "y", ymin = -Inf, ymax = Inf
) +
geom_line()
Created on 2021-09-14 by the reprex package (v2.0.1)

ggpubr ggbarplot plotted with bizarre xvalues

I have the following code using the CSV below
library(ggpubr)
library(ggsci)
df = read.csv2("file.csv", row.names=1)
# Copy df
df2 = df
# Convert the cyl variable to a factor
df2$perc <- as.factor(df2$perc)
# Add the name colums
df2$name <- rownames(df)
ggbarplot(df2, x = "name", y = "perc",
fill = "role", # change fill color by cyl
color = "white", # Set bar border colors to white
palette = "npg", # jco journal color palett. see ?ggpar
sort.val = "asc", # Sort the value in dscending order
sort.by.groups = FALSE, # Don't sort inside each group
x.text.angle = 0, # Rotate vertically x axis texts
rotate = TRUE,
label = TRUE, label.pos = "out",
#label = TRUE, lab.pos = "in", lab.col = "white",
width = 0.5
)
the CSV is :
genes;perc;role
GATA-3;7,9;confirmed in this cancer
CCDC74A;6,8;prognostic in this cancer
LINC00621;6,1;none
POLRMTP1;4,1;none
IGF2BP3;3,2;confirmed in this cancer
which produced this plot
There are two things I don't get here:
1) Why the x-axis tick of each bar correspond to the actual value plotted ? I mean why the x-axis isn't from 0 to 8, and should be in my opinion. I hope I explain correctly.
2) The label value seems unaligned with the y-thick. Am I missing an option here ?
To be honest, I would probably not use ggpubr here. Staying in the ggplot syntax is often safer. And also arguably less code...
(Also, don't use factors in this case, as user teunbrand commented)
Two good options for horizontal bars:
library(tidyverse)
library(ggstance)
library(ggsci)
Option 1 - use coord_flip
ggplot(df2, aes(fct_reorder(genes, perc), perc, fill = role)) +
geom_col() +
geom_text(aes(label = perc), hjust = 0) +
scale_fill_npg() +
coord_flip(ylim = c(0,100)) +
theme_classic() +
theme(legend.position = 'top') +
labs(x = 'gene', y = 'percent')
option 2 - use the ggstance package
I prefer option 2, because using ggstance allows for more flexible combination with other plots
ggplot(df2, aes(perc, fct_reorder(genes, perc), fill = role)) +
geom_colh() +
geom_text(aes(label = perc), hjust = 0)+
scale_fill_npg() +
coord_cartesian(xlim = c(0,100)) +
theme_classic() +
theme(legend.position = 'top')+
labs(x = 'gene', y = 'percent')
Created on 2020-03-27 by the reprex package (v0.3.0)
data
df2 <- read_delim("genes;perc;role
GATA-3;7,9;confirmed in this cancer
CCDC74A;6,8;prognostic in this cancer
LINC00621;6,1;none
POLRMTP1;4,1;none
IGF2BP3;3,2;confirmed in this cancer", ";") %>% rownames_to_column("name")

R: placing a text with combination of variables over bars in ggplot

Lets draw a bar chart with ggplot2 from the following data (already in a long format). The values of the variable are then placed in the middle of the bars via geom_text() directive.
stuff.dat<-read.csv(text="continent,stuff,num
America,apples,13
America,bananas,13
Europe,apples,30
Europe,bananas,21
total,apples,43
total,bananas,34")
library(ggplot2)
ggplot(stuff.dat, aes(x=continent, y=num,fill=stuff))+geom_col() +
geom_text(position = position_stack(vjust=0.5),
aes(label=num))
Now it is necessary to add on top of the bars the "Apple-Bananas Index", which is defined as f=apples/bananas - just as manually added in the figure. How to program this in ggplot? How it would be possible to add it to the legend as a separate entry?
I think that the easiest way to achieve this is to prepare the data before you create the plot. I define a function abi() that computes the apple-banana-index from stuff.dat given a continent:
abi <- function(cont) {
with(stuff.dat,
num[continent == cont & stuff == "apples"] / num[continent == cont & stuff == "bananas"]
)
}
And then I create a data frame with all the necessary data:
conts <- levels(stuff.dat$continent)
abi_df <- data.frame(continent = conts,
yf = aggregate(num ~ continent, sum, data = stuff.dat)$num + 5,
abi = round(sapply(conts, abi), 1))
Now, I can add that information to the plot:
library(ggplot2)
ggplot(stuff.dat, aes(x = continent, y = num, fill = stuff)) +
geom_col() +
geom_text(position = position_stack(vjust = 0.5), aes(label = num)) +
geom_text(data = abi_df, aes(y = yf, label = paste0("f = ", abi), fill = NA))
Adding fill = NA to the geom_text() is a bit of a hack and leads to a warning. But if fill is not set, plotting will fail with a message that stuff was not found. I also tried to move fill = stuff from ggplot() to geom_col() but this breaks the y⁻coordinate of the text labels inside the bars. There might be a cleaner solution to this, but I haven't found it yet.
Adding the additional legend is, unfortunately, not trivial, because one cannot easily add text outside the plot area. This actually needs two steps: first one adds text using annotation_custom(). Then, you need to turn clipping off to make the text visible (see, e.g., here). This is a possible solution:
p <- ggplot(stuff.dat, aes(x = continent, y = num, fill = stuff)) +
geom_col() +
geom_text(position = position_stack(vjust = 0.5), aes(label = num)) +
geom_text(data = abi_df, aes(y = yf, label = paste0("f = ", abi), fill = NA)) +
guides(size = guide_legend(title = "f: ABI", override.aes = list(fill = 1))) +
annotation_custom(grob = textGrob("f: ABI\n(Apple-\nBanana-\nIndex",
gp = gpar(cex = .8), just = "left"),
xmin = 3.8, xmax = 3.8, ymin = 17, ymax = 17)
# turn off clipping
library(grid)
gt <- ggplot_gtable(ggplot_build(p))
gt$layout$clip[gt$layout$name == "panel"] <- "off"
grid.draw(gt)

ggplot2: dealing with extremes values by setting a continuous color scale

I am trying to plot some global maps (raster files) and I have some problems in setting up a good color scale for my data. What I would like to do is to plot my data using a divergent palette (e.g. cm.colors), and I would like to center the color "white" of such scale with the value zero, but without having to set symmetric values in the scale (i.e. the same value both negative and positive, i.e. limits=c(-1,1)). Additionally, I would like to plot all values above and/or below a certain value all with the same color.
In other words, if we suppose that my map has a range of -100 to 150, I would like to plot my map with a diverging palette with a "white" color corresponding to the value 0, and having all values e.g. below -20 and above 50 plotted with the same color, i.e. respectively with the negative and positive extremes of the color palette.
Here an example of the code that I am using for the moment:
ggplot(df, aes(y=Latitude, x=Longitude)) +
geom_raster(aes(fill=MAP)) +
coord_equal()+
theme_gray() +
theme(panel.background = element_rect(fill = 'skyblue2', colour = 'black'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "right",
legend.key = element_blank()) +
scale_fill_gradientn("MAP", limits=c(-0.5,1), colours=cm.colors(20))
There are simple ways to accomplish this, such as truncating your data beforehand, or using cut to create discrete bins for appropriate labels.
require(dplyr)
df %>% mutate(z2 = ifelse(z > 50, 50, ifelse(z < -20, -20, z))) %>%
ggplot(aes(x, y, fill = z2)) + geom_tile() +
scale_fill_gradient2(low = cm.colors(20)[1], high = cm.colors(20)[20])
df %>% mutate(z2 = cut(z, c(-Inf, seq(-20, 50, by = 10), Inf)),
z3 = as.numeric(z2)-3) %>%
{ggplot(., aes(x, y, fill = z3)) + geom_tile() +
scale_fill_gradient2(low = cm.colors(20)[1], high = cm.colors(20)[20],
breaks = unique(.$z3), labels = unique(.$z2))}
But I'd thought about this task before, and felt unsatisfied with that. The pre-truncating doesn't leave nice labels, and the cut option is always fiddly (particularly having to adjust the parameters of seq inside cut and figure out how to recenter the bins). So I tried to define a reusable transformation that would do the truncating and relabeling for you.
I haven't fully debugged this and I'm going out of town, so hopefully you or another answerer can take a crack at it. The main problem seems to be collisions in the edge cases, so occasionally the limits overlap the intended breaks visually, as well as some unexpected behavior with the formatting. I just used some dummy data to create your desired range of -100 to 150 to test it.
require(scales)
trim_tails <- function(range = c(-Inf, Inf)) trans_new("trim_tails",
transform = function(x) {
force(range)
desired_breaks <- extended_breaks(n = 7)(x[x >= range[1] & x <= range[2]])
break_increment <- diff(desired_breaks)[1]
x[x < range[1]] <- range[1] - break_increment
x[x > range[2]] <- range[2] + break_increment
x
},
inverse = function(x) x,
breaks = function(x) {
force(range)
extended_breaks(n = 7)(x)
},
format = function(x) {
force(range)
x[1] <- paste("<", range[1])
x[length(x)] <- paste(">", range[2])
x
})
ggplot(df, aes(x, y, fill = z)) + geom_tile() +
guides(fill = guide_colorbar(label.hjust = 1)) +
scale_fill_gradient2(low = cm.colors(20)[1], high = cm.colors(20)[20],
trans = trim_tails(range = c(-20,50)))
Also works with a boxed legend instead of a colorbar, just use ... + guides(fill = guide_legend(label.hjust = 1, reverse = T)) + ...

Most succinct way to label/annotate extreme values with ggplot?

I'd like to annotate all y-values greater than a y-threshold using ggplot2.
When you plot(lm(y~x)), using the base package, the second graph that pops up automatically is Residuals vs Fitted, the third is qqplot, and the fourth is Scale-location. Each of these automatically label your extreme Y values by listing their corresponding X value as an adjacent annotation. I'm looking for something like this.
What's the best way to achieve this base-default behavior using ggplot2?
Updated scale_size_area() in place of scale_area()
You might be able to take something from this to suit your needs.
library(ggplot2)
#Some data
df <- data.frame(x = round(runif(100), 2), y = round(runif(100), 2))
m1 <- lm(y ~ x, data = df)
df.fortified = fortify(m1)
names(df.fortified) # Names for the variables containing residuals and derived qquantities
# Select extreme values
df.fortified$extreme = ifelse(abs(df.fortified$`.stdresid`) > 1.5, 1, 0)
# Based on examples on page 173 in Wickham's ggplot2 book
plot = ggplot(data = df.fortified, aes(x = x, y = .stdresid)) +
geom_point() +
geom_text(data = df.fortified[df.fortified$extreme == 1, ],
aes(label = x, x = x, y = .stdresid), size = 3, hjust = -.3)
plot
plot1 = ggplot(data = df.fortified, aes(x = .fitted, y = .resid)) +
geom_point() + geom_smooth(se = F)
plot2 = ggplot(data = df.fortified, aes(x = .fitted, y = .resid, size = .cooksd)) +
geom_point() + scale_size_area("Cook's distance") + geom_smooth(se = FALSE, show_guide = FALSE)
library(gridExtra)
grid.arrange(plot1, plot2)

Resources