Is there a way to include character values on the axes when plotting continuous data with ggplot2? I have censored data such as:
x y Freq
1 -3 16 3
2 -2 12 4
3 0 10 6
4 2 7 7
5 2 4 3
The last row of data are right censored. I am plotting this with the code below to produce the following plot:
a1 = data.frame(x=c(-3,-2,0,2,2), y=c(16,12,10,7,4), Freq=c(3,4,6,7,3))
fit = ggplot(a1, aes(x,y)) + geom_text(aes(label=Freq), size=5)+
theme_bw() +
scale_x_continuous(breaks = seq(min(a1$x)-1,max(a1$x)+1,by=1),
labels = seq(min(a1$x)-1,max(a1$x)+1,by=1),
limits = c(min(a1$x)-1,max(a1$x)+1))+
scale_y_continuous(breaks = seq(min(a1$y),max(a1$y),by=2))
The 3 points at (2,4) are right censored. I would like them to be plotted one unit to the right with the corresponding xaxis tick mark '>=2' instead of 3. Any ideas if this is possible?
It is quite possible. I hacked the data so 2,4 it's 3,4. Then I modified your labels which can be whatever you want as long as they are the same length as the breaks.
ggplot(a1, aes(x,y)) + geom_text(aes(label=Freq), size=5)+
theme_bw() +
scale_x_continuous(breaks = seq(min(a1$x)-1,max(a1$x),by=1),
labels = c(seq(min(a1$x)-1,max(a1$x)-1,by=1), ">=2"),
limits = c(min(a1$x)-1,max(a1$x)))+
scale_y_continuous(breaks = seq(min(a1$y),max(a1$y),by=2))
Related
The Ask:
Please help me understand my conceptual error in the use of scale_x_binned() in ggplot2 as it relates to centering breaks beneath the appropriate bin in a geom_histogram().
Starting Example:
library(ggplot2)
df <- data.frame(hour = sample(seq(0,23), 150, replace = TRUE))
# The data is just the integer values of the 24-hour clock in a day. It is
# **NOT** continuous data.
ggplot(df, aes(x = hour)) +
geom_histogram(bins = 24, fill = "grey60", color = "red")
This produces a histogram with labels properly centered beneath the
bin for which it belongs, but I want to label each hour, 0 - 23.
To do that, I thought I would assign breaks using scale_x_binned()
as demonstrated below.
Now I try to add the breaks:
ggplot(df, aes(x = hour)) +
geom_histogram(bins = 24, fill = "grey60", color = "red") +
scale_x_binned(name = "Hour of Day",
breaks = seq(0,23))
#> Warning: Removed 1 rows containing missing values (`geom_bar()`).
This returns the number of labels I wanted, but they are not centered
beneath the bins as desired. I also get the warning message for missing
values associated with geom_bar().
I believe I am overwriting the bins = 24 from the geom_histogram() call when I use the scale_x_binned() call afterward, but I don't understand exactly what is causing geom_histogram() to be centered in the first case that I am wrecking with my new call. I'd really like to have that clarified as I am not seeing my error when I read the associated help pages.
EDIT:
The "Starting Example" essentially works (bins are centered) except for the number of labels I ultimately want. If you built the ggplot2 layer differently, what is the equivalent code? That is, instead of:
ggplot(df, aes(x = hour)) +
geom_histogram(bins = 24, fill = "grey60", color = "red")
the call was instead built something like:
ggplot(df, aes(x = hour)) +
geom_histogram(fill = "grey60", color = "red") +
scale_x_binned(n.breaks = 24) # I know this isn't right, but akin to this.
or maybe
ggplot(df, aes(x = hour)) +
stat_bin(bins = 24, center = 0, fill = "grey60", color = "red")
It sounds like you are looking to use non-default labeling, where you want the labels to be aligned to the midpoint of the bins instead of their boundaries, which is what the breaks define. We could do that by using a continuous scale and hiding the main breaks, but keeping the minor breaks, like below.
scale_x_binned does not have minor breaks. It only has breaks at the boundaries of the bins, so it's not obvious to me how you could place the break labels at the midpoints of the bins.
ggplot(df, aes(x = hour)) +
geom_histogram(bins = 24, fill = "grey60", color = "red") +
scale_x_continuous(name = "Hour of Day", breaks = 0:23) +
theme(axis.ticks = element_blank(),
panel.grid.major.x = element_blank())
I though the same as you, namely scale_x_discrete, but the data given to geom_histogram is assumed to be continuous, so ...
ggplot(df, aes(x = hour)) +
geom_histogram(bins = 24, fill = "grey60", color = "red") +
scale_x_continuous(breaks = 0:23)
(Doesn't require any machinations with theme.)
I wish I could tell you that I found out how geom_histogram is centering the labels, but ggproto objects exist in a cavern with too many tunnels and passages for my mind to follow.
So I took a shot at examining the plot object that I created when I produced the png graphic above:
ggplot_build(plt)
# ------------
$data
$data[[1]]
y count x xmin xmax density ncount ndensity flipped_aes PANEL group ymin ymax colour fill size linetype
1 6 6 0 -0.5 0.5 0.04000000 0.6 0.6 FALSE 1 -1 0 6 red grey60 0.5 1
2 7 7 1 0.5 1.5 0.04666667 0.7 0.7 FALSE 1 -1 0 7 red grey60 0.5 1
3 4 4 2 1.5 2.5 0.02666667 0.4 0.4 FALSE 1 -1 0 4 red grey60 0.5 1
4 5 5 3 2.5 3.5 0.03333333 0.5 0.5 FALSE 1 -1 0 5 red grey60 0.5 1
5 7 7 4 3.5 4.5 0.04666667 0.7 0.7 FALSE 1 -1 0 7 red grey60 0.5 1
#snipped remainder
So the reason the break tick-marks are centered is that the bin construction is set up so they all are centered on the breaks.
Further exploration f whats in ggplot_build results:
ls(envir=ggplot_build(plt)$layout)
#[1] "coord" "coord_params" "facet" "facet_params" "layout" "panel_params"
#[7] "panel_scales_x" "panel_scales_y" "super"
ggplot_build(plt)$layout$panel_params
#-------results
[[1]]
[[1]]$x
<ggproto object: Class ViewScale, gg>
aesthetics: x xmin xmax xend xintercept xmin_final xmax_final xlower ...
break_positions: function
break_positions_minor: function
breaks: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ...
continuous_range: -1.7 24.7
dimension: function
get_breaks: function
get_breaks_minor: function
#---- snipped remaining outpu
I have a histogram displaying the area distribution of 12,000 observations with area on the x-axis and frequency of a given area on the y-axis.
I want to add a vertical line which sits at the area of a specific observation on the x-axis.
I have been trying various combinations of the geom_vline function without success
ggplot(a_b, aes(x=a_area_km)) +
geom_histogram(fill= 'light blue', bins = 20) +
geom_vline(aes(xintercept= a_b$identifier == '1540')) +
scale_x_continuous(trans = "log10")
Any help would be appreciated
ID
AREA
1
3.8493732
2
1.9130095
3
2.3303074
4
0.8634214
5
0.5458977
6
1.5271307
7
12.4303822
8
0.6196505
9
2.0999631
10
0.2086267
11
0.6889139
12
1.0927132
13
10.9666451
14
4.6828732
15
0.2302338
If you want a vertical line at the value of the 1540 row of a_area_km then you should do xintercept= a_area_km[1540]:
library(ggplot2)
ggplot(a_b, aes(x=a_area_km)) +
geom_histogram(fill= 'light blue', bins = 20) +
geom_vline(aes(xintercept= a_area_km[1540])) +
scale_x_continuous(trans = "log10")
Data
set.seed(1)
a_b <- data.frame(a_area_km = rgamma(12000, 3, .5))
a_b$a_area_km[1540]
[1] 5.573579
We can use
ggplot(a_b, aes(x=a_area_km)) +
geom_histogram(fill= 'light blue', bins = 20) +
geom_vline(aes(xintercept= a_area_km[match(1540, identifier)])) +
scale_x_continuous(trans = "log10")
ggplot(df) +
geom_bar(aes(x=Date, y=DCMTotalCV, fill=CampaignName), stat='identity', position='stack') +
geom_line(aes(x=Date, y=DCMCPA, color=CampaignName, group=as.factor(CampaignName)), na.rm = FALSE,show.legend=NA)+
scale_y_continuous(sec.axis = sec_axis(~./1000, name = "DCMTotalCV"))+
theme_bw()+
labs(
x= "Date",
y= "CPA",
title = "Daily Performance"
)
Hey everyone - so I have 2 y-axes i want to plot. geom_line is registering fine on the main y-axis but geom_bar is not registering properly on the right. I tried scaling but it's still not registering or plotting on that second axis. It looks like it's still appearing on the main y-axis so I'm wondering how to tell the plot to plot it on the second one? Sorry i'm kind of a newbie. Thanks!
data <- data.frame(
day = as.Date("2020-01-01"),
conversions = seq(1,6)^2,
cpa = 100000 / seq(1,6)^2
)
head(data)
str(data)
#plot
ggplot(data, aes(x=day)) +
geom_bar( aes(y=conversions), stat='identity') +
geom_line( aes(y=cpa)) +
scale_y_continuous(sec.axis = sec_axis(~./1000))
ggplot2::sec_axis is intended only to put up the scale itself; it does nothing to try to scale the values (that you are pairing with that axis). Why? Primarily because it knows nothing about which y variable you are intending to pair with which y-axis. (Is there anywhere in sec_axis to tell it that it should be looking at a particular variable? Nope.)
As a demonstration, let's start with some random data and plot the line.
set.seed(42)
dat <- data.frame(x = rep(1:10), y1 = sample(10), y2 = sample(100, size = 10))
dat
# x y1 y2
# 1 1 1 47
# 2 2 5 24
# 3 3 10 71
# 4 4 8 89
# 5 5 2 37
# 6 6 4 20
# 7 7 6 26
# 8 8 9 3
# 9 9 7 41
# 10 10 3 97
ggplot(dat, aes(x, y1)) +
geom_line() +
scale_y_continuous(name = "Oops!")
Now you determine that you want to add the y2 variable in there, but because its values are on a completely different scale, you think to just add them (I'll use geom_text here) and then set a second axis.
ggplot(dat, aes(x, y1)) +
geom_line() +
geom_text(aes(y = y2, label = y2)) +
scale_y_continuous(name = "Oops!", sec.axis = sec_axis(~ . * 10, name = "Quux!"))
Two things wrong with this:
The primary (left) y-axis now scales from 0 to 100, scrunching the primary y values to the bottom of the plot; and
Related, the secondary (right) y-axis scales from 0 to 1000?!? This is because the only thing that the secondary axis "knows" is the values that go into the primary axis ... and the primary axis is scaling to fit all of the y* variables it is told to plot.
That last point is important: this is giving y values that scale from 0 to 100, so the axis will reflect that. You can do lims(y=c(0,10)), but realize you'll be truncating y2 values ... that's not the right approach.
Instead, you need to scale the second values to be within the same range of values as the primary axis variable y1. Though not required, I'll use scale::rescale for this.
dat$y2scaled <- scales::rescale(dat$y2, range(dat$y1))
dat
# x y1 y2 y2scaled
# 1 1 1 47 5.212766
# 2 2 5 24 3.010638
# 3 3 10 71 7.510638
# 4 4 8 89 9.234043
# 5 5 2 37 4.255319
# 6 6 4 20 2.627660
# 7 7 6 26 3.202128
# 8 8 9 3 1.000000
# 9 9 7 41 4.638298
# 10 10 3 97 10.000000
Notice how y2scaled is now proportionately within y1's range?
We'll use that to position each of the text objects (though we'll still show the y2 as the label here).
ggplot(dat, aes(x, y1)) +
geom_line() +
geom_text(aes(y = y2scaled, label = y2)) +
scale_y_continuous(name = "Oops!", sec.axis = sec_axis(~ . * 10, name = "Quux!"))
Are we strictly required to make sure that the points pairing with the secondary axis perfectly fill the range of values of the primary axis? No. We could easily have thought to keep the text labels only on the bottom half of the plot, so we'd have to scale appropriately.
dat$y2scaled2 <- scales::rescale(dat$y2, range(dat$y1) / c(1, 2))
dat
# x y1 y2 y2scaled y2scaled2
# 1 1 1 47 5.212766 2.872340
# 2 2 5 24 3.010638 1.893617
# 3 3 10 71 7.510638 3.893617
# 4 4 8 89 9.234043 4.659574
# 5 5 2 37 4.255319 2.446809
# 6 6 4 20 2.627660 1.723404
# 7 7 6 26 3.202128 1.978723
# 8 8 9 3 1.000000 1.000000
# 9 9 7 41 4.638298 2.617021
# 10 10 3 97 10.000000 5.000000
ggplot(dat, aes(x, y1)) +
geom_line() +
geom_text(aes(y = y2scaled2, label = y2)) +
scale_y_continuous(name = "Oops!", sec.axis = sec_axis(~ . * 20, name = "Quux!"))
Notice that not only did I change how the y-axis values were scaled (now ranging from 0 to 5 in y2scaled2), but I also had to change the transformation within sec_axis to be *20 instead of *10.
Sometimes getting these transformations correct can be confusing, and it is easy to mess them up. However ... realize that it took many years to even get this functionality into ggplot2, mostly due to the lead developer(s) belief that even when plotted well, they can be confusing to the viewer, and potentially provide misleading takeaways. I find that they can be useful sometimes, and there are techniques one can use to encourage correct interpretation, but ... it's hard to get because it's easy to get wrong.
As an example of one technique that helps distinguish which axis goes with which data, see this:
ggplot(dat, aes(x, y1)) +
geom_line(color = "blue") +
geom_text(aes(y = y2scaled2, label = y2), color = "red") +
scale_y_continuous(name = "Oops!", sec.axis = sec_axis(~ . * 20, name = "Quux!")) +
theme(
axis.ticks.y.left = element_line(color = "blue"),
axis.text.y.left = element_text(color = "blue"),
axis.title.y.left = element_text(color = "blue"),
axis.ticks.y.right = element_line(color = "red"),
axis.text.y.right = element_text(color = "red"),
axis.title.y.right = element_text(color = "red")
)
(One might consider colors from viridis for a more color-blind palette.)
I have a data frame 'data' with three columns. The first column identifies the compound, the second the concentration of the compound and the third my measured data called 'Area'.
# A tibble: 12 x 3
Compound Conc Area
<chr> <dbl> <dbl>
1 Compound 1 0 247
2 Compound 1 5 44098
3 Compound 1 100 981797
4 Compound 1 1000 7084602
5 Compound 2 0 350
6 Compound 2 5 310434
7 Compound 2 100 6621537
8 Compound 2 1000 49493832
9 Compound 3 0 26
10 Compound 3 5 7707
11 Compound 3 100 174026
12 Compound 3 1000 1600143
I want to create a facetted plot per compound using geom_point & apply geom_smooth on the complete x axis. To look into detail in the lower concentration range I applied coord_cartesian to limit the x axis from 0 to 110.
However, each facet takes the maximum value of the given compound. As the scales are very different between compounds I can't use a fixed ylim as it would have to be different for each compound (in my real data I have > 20 compounds).
Is there a possibility to set the y-axis from 0 as minimum and as maximum per facet the maximal value which is visible?
The code I have (without any tries on limiting the y-axis is:
ggplot(data = data, aes(Conc, Area)) +
geom_point(size = 2.5) +
geom_smooth(method = "lm") +
facet_wrap(~Compound, ncol = 3, scales = "free_y") +
theme_bw() +
theme(legend.position = "bottom") +
coord_cartesian(xlim = c(0,110))
I figured out a workaround to get the results I want.
After creating a subset of the data I created a loop to plot all the data.
The subsetted data was used to determine the ylim in coord_cartesian.
With the resulting plot list I can use the gridExtra package to sort them in a grid.
data_100 <- data %>%
filter(Conc <= 110)
loop.vector <- unique(data$Compound)
plot_list = list()
for (i in seq_along(loop.vector)) {
p = ggplot(subset(data, data$Compound==loop.vector[i]),
aes(Conc, Area)) +
geom_point(size=2.5) +
geom_smooth(method = "lm", se = FALSE) +
theme_bw() +
theme(legend.position="bottom") +
coord_cartesian(xlim = c(0,110),
ylim = c(0, max(data_100$Area[data_100$Compound==loop.vector[i]]))) +
labs(title = loop.vector[i])
plot_list[[i]] = p
print(p)
}
My data is a single column like this:
Number Assigned Row
1 1
2 1
3 2
4 1
5 2
6 3
... ...
When I plot using barplot I get what I want:
However when I use ggplot + geom_bar I get this:
This is my code for ggplot:
count <- data.frame(alldata[[xaxis]])
ggplot(data=count, aes(x="My X Axis", y="My Y Axis")) +
geom_bar(stat="identity")
versus the code I use for barplot:
counts <- table(alldata[[xaxis]])
barplot(counts,
main = xaxis,
xlab = "Percentile",
cex.names = 0.8,
col=c("darkblue","red"), beside = group != "NA")
Say this is your data:
df <- data.frame(AssRow = sample(1:3, 100, T, c(0.2, 0.5, 0.3)))
head(df)
# AssRow
#1 2
#2 1
#3 2
#4 3
#5 2
#6 2
This will get you a bar chart of the count of each Assigned Row, and colour them:
ggplot(df, aes(x=AssRow, fill=as.factor(AssRow))) +
geom_bar()
To change the labels use xlab ylab/make the background prettier:
ggplot(df, aes(x=AssRow, fill=as.factor(AssRow))) +
geom_bar() +
xlab("My X-label") +
ylab("My Y label") +
theme_bw()
Output: