Setting facet-specific breaks in stat_contour - r

I'd like to show a contour plot using ggplot and stat_contour for two categories of my data with facet_grid. I want to highlight a particular level based on the data. Here's an analogous dummy example using the usual volcano data.
library(dplyr)
library(ggplot2)
v.plot <- volcano %>% reshape2::melt(.) %>%
mutate(dummy = Var1 > median(Var1)) %>%
ggplot(aes(Var1, Var2, z = value)) +
stat_contour(breaks = seq(90, 200, 12)) +
facet_grid(~dummy)
Plot 1:
Let's say within each factor level (here east and west halves, I guess), I want to find the mean height of the volcano and show that. I can calculate it manually:
volcano %>% reshape2::melt(.) %>%
mutate(dummy = Var1 > median(Var1)) %>%
group_by(dummy) %>%
summarise(h.bar = mean(value))
# A tibble: 2 × 2
dummy h.bar
<lgl> <dbl>
1 FALSE 140.7582
2 TRUE 119.3717
Which tells me that the mean heights on each half are 141 and 119. I can draw BOTH of those on BOTH facets, but not just the appropriate one on each side.
v.plot + stat_contour(breaks = c(141, 119), colour = "red", size = 2)
Plot 2:
And you can't put breaks= inside an aes() statement, so passing it in as a column in the original dataframe is out. I realize with this dummy example I could probably just do something like bins=2 but in my actual data I don't want the mean of the data, I want something else altogether.
Thanks!

I made another attempt at this problem and came up with a partial solution, but I'm forced to use a different geom.
volcano %>% reshape2::melt(.) %>%
mutate(dummy = Var1 > median(Var1)) %>%
group_by(dummy) %>%
mutate(h.bar = mean(value), # edit1
is.close = round(h.bar) == value) %>% #
ggplot(aes(Var1, Var2, z = value)) +
stat_contour(breaks = seq(90, 200, 12)) +
geom_point(colour = "red", size = 3, # edit 2
aes(alpha = is.close)) + #
scale_alpha_discrete(range = c(0,1)) + #
facet_grid(~dummy)
In edit 1 I added a mutate() to the above block to generate a variable identifying where value was "close enough" (rounded to the nearest integer) to the desired highlight point (the mean of the data for this example).
In edit2 I added geom_points to show the grid locations with the desired value, and hid the undesired ones using an alpha of 0 or totally transparent.
Plot 3:
The problem with this solution is that it's very gappy, and trying to bridge those with geom_path is a jumbled mess. I tried coarser rounding as well, and it just made things muddy.
Would love to hear other ideas! Thanks

Related

How to calculate and label peak value of distribution by multiple conditions/facets in R ggplot?

While the question appears similar to others, there's a key difference in my mind.
I want to be able to calculate and/or print (graphing it would be the ultimate goal, but calculating it in the data frame the primary goal) the peak value of a density curve of EACH SUB-CONDITION BY FACET The density graph looks like this:
So, ideally, I would be able to know the intensity (x-axis value) corresponding to the highest peak of the density curves for each condition.
Here's some dummy data:
set.seed(1234)
library(tidyverse)
library(fs)
n = 100000
silence = factor(c("sil1", "sil2", "sil3", "sil4", "sil5"))
treat = factor(c("con", "uos", "uos+wnt5a", "wnt5a"))
silence = rep(silence, n)
treat = rep(treat, n)
intensity = sample(4000:10000, n)
df <- cbind(silence, treat, intensity)
df$silence <- silence
df$treat <- treat
What I've tried:
Subsetting the primary DF and going through and calculating the density of each condition, but this could take days
Something close to this answer: Calculating peaks in histograms or density functions but not quite. I think the data look better as a histogram personally, but that constructs an arbitrary number of bins for intensity data (a continuous measure). The histogram looks like this:
Again, it would be sufficient to get the peak values for each of these groups (i.e., treatments by silencing subdistributions) just in the console, but adding them as a vertical line in the graphs would be a sweet cherry on top (it could also make it hella busy, so I will see about that piece later)
Thank you!!
Depending on the way you're producing the density plots, there may be a more direct way to recreate the density calculation before it goes into ggplot. That'll be the easiest way to get the peak values and keep them in the format of your data.
Without that, here's a hack that should work in general, but requires some kludging to fit the extracted points back into the form of your original data.
Here's a plot like yours:
mtcars %>%
mutate(gear = as.character(gear)) %>%
ggplot(aes(wt, fill = gear, group = gear)) +
geom_density(alpha = 0.2) +
facet_wrap(~am) ->my_plot
Here are the components that make up that plot:
ggplot_build(my_plot) -> my_plot_innards
With some ugly hacking we can extract the points that make up the curves and make them look kind of like our original data. Some info is destroyed, e.g. the gear values 3/4/5 become group 1/2/3. There might be a cool way to convert back, but I don't know it yet.
extracted_points <- tibble(
wt = my_plot_innards[["data"]][[1]][["x"]],
y = my_plot_innards[["data"]][[1]][["y"]],
gear = (my_plot_innards[["data"]][[1]][["group"]] + 2) %>% as.character, # HACK
am = (my_plot_innards[["data"]][[1]][["PANEL"]] %>% as.numeric) - 1 # HACK
)
ggplot(extracted_points, aes(wt, y, fill = gear)) +
geom_point(size = 0.3) +
facet_wrap(~am)
extracted_points_notes <- extracted_points %>%
group_by(gear, am) %>%
slice_max(y)
my_plot +
geom_point(data = extracted_points_notes,
aes(y = y), color = "red", size = 3, show.legend = FALSE) +
geom_text(data = extracted_points_notes, hjust = -0.5,
aes(y = y, label = scales::comma(y)), color = "red", size = 3, show.legend = FALSE)

ggRadar highlight top values in radar

Hi everyone I am making a a radar plot and I want to highlight the two highest values in the factors or levels. Highlight in this case is to make the text of the top tree values bold
require(ggplot2)
require(ggiraph)
require(plyr)
require(reshape2)
require(moonBook)
require(sjmisc)
ggRadar(iris,aes(x=c(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)))
an example can be like this
thank you
Here is a step-by-step example of how to highlight specific categories in a radar plot. I don't really see the point of all these extra dependencies (ggRadar etc.), as it's pretty straightforward to draw a radar plot in ggplot2 directly using polar coordinates.
First, let's generate some sample data. According to OPs comments and his example based on the iris dataset, we select the maximal value for every variable (from Sepal.Length, Sepal.Width, Petal.Length, Petal.Width); we then store the result in a long tibble for plotting.
library(purrr)
library(dplyr)
library(tidyr)
df <- iris %>% select(-Species) %>% map_df(max) %>% pivot_longer(everything())
df
# # A tibble: 4 x 2
# name value
# <chr> <dbl>
#1 Sepal.Length 7.9
#2 Sepal.Width 4.4
#3 Petal.Length 6.9
#4 Petal.Width 2.5
Next, we make use of a custom coord_radar function (thanks to this post), that is centred around coord_polar and ensures that polygon lines in a polar plot are straight lines rather than curved arcs.
coord_radar <- function (theta = "x", start = - pi / 2, direction = 1) {
theta <- match.arg(theta, c("x", "y"))
r <- if (theta == "x") "y" else "x"
ggproto(
"CordRadar", CoordPolar, theta = theta, r = r, start = start,
direction = sign(direction),
is_linear = function(coord) TRUE)
}
We now create a new column df$face that is "bold" for the top 3 variables (ranked by decreasing value) and "plain" otherwise. We also need to make sure that factor levels of our categories are sorted by row number (otherwise name and face won't necessarily match later).
df <- df %>%
mutate(
rnk = rank(-value),
face = if_else(rnk < 4, "bold", "plain"),
name = factor(name, levels = unique(name)))
We can now draw the plot
library(ggplot2)
ggplot(df, aes(name, value, group = 1)) +
geom_polygon(fill = "red", colour = "red", alpha = 0.4) +
geom_point(colour = "red") +
coord_radar() +
ylim(0, 10) +
theme(axis.text.x = element_text(face = df$face))
Note that this gives a warning, which I choose to ignore here, as we explicitly make use of the vectorised element_text option.
Warning message:
Vectorized input to element_text() is not officially supported.
Results may be unexpected or may change in future versions of ggplot2.
My suggestion would be to identify the highest values you wish to highlight, and put them in a dataframe. Then use geom_richtext() to highlight.

R: suitable plot to display data with skewed counts

I have data like:
Name Count
Object1 110
Object2 111
Object3 95
Object4 40
...
Object2000 1
So only the first 3 objects have high counts, the rest 1996 objects have fewer than 40, with the majority less than 10. I am plotting this data with ggplot bar like:
ggplot(data=object_count, mapping = aes(x=object, y=count)) +
geom_bar(stat="identity") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
My plot is as below. As you can see, because there are so many objects with low counts, the width of the graph is very long, and the width of the bar is tiny, which is almost invisible for the hight-counts objects. Is there a better way to represent this data? My goal is to show a few top-count objects and to show there are many low-count ones. Is there a way to group the low count ones together?
My guess is that your data looks something like this:
set.seed(1)
object_count <- tibble(
obj_num = 1:2000,
object = paste0("Object", obj_num),
count = ceiling(20 * rpois(2000, 10) / obj_num)
)
head(object_count)
## A tibble: 6 x 3
# obj_num object count
# <int> <chr> <dbl>
#1 1 Object1 160
#2 2 Object2 100
#3 3 Object3 46
#4 4 Object4 55
#5 5 Object5 56
#6 6 Object6 40
Sure enough, when I plot that with ggplot(object_count, aes(object, count)) + geom_col() + [theme stuff] I get a similar figure.
Here are some strategies "to show a few top-count objects and to show there are many low-count ones."
Histogram
A vanilla histogram might not be clarifying here, since the important big values appear dramatically less often and would not be prominent enough:
ggplot(object_count, aes(count)) +
geom_histogram()
But we could change that by transforming the y axis to bring more emphasis to small values. The pseudo_log transformation is nice for that since it works like a log transform for large values, but linearly near -1 to 1. In this view, we can clearly see where the outliers with just one appearance are, but also see that there are many more small values. The binwidth = 1 here could be set to something wider if the specific values of the big values aren't as important as their general range.
ggplot(object_count, aes(count)) +
geom_histogram(binwidth = 1) +
scale_y_continuous(trans = "pseudo_log",
breaks = c(0:3, 100, 1000), minor_breaks = NULL)
Faceting
Another option could be to split your view into two pieces, one with detail on the big values, the other showing all the small values:
object_count %>%
mutate(biggies = if_else(count > 20, "Big", "Little")) %>%
ggplot(aes(obj_num, count)) +
geom_col() +
facet_grid(~biggies, scales = "free")
Lumping
Another option might be too lump together all the counts under 10. The version below emphasizes the object name and count, and the "Other" category has been labeled to show how many values it includes.
object_count %>%
mutate(group = if_else(count < 10, "Others", object)) %>%
group_by(group) %>%
summarize(avg = mean(count), count = n()) %>%
ungroup() %>%
mutate(group = if_else(group == "Others",
paste0("Others (n =", count, ")"),
group)) %>%
mutate(group = forcats::fct_reorder(group, avg)) %>%
ggplot() +
geom_col(aes(group, avg)) +
geom_text(aes(group, avg, label = round(avg, 0)), hjust = -0.5) +
coord_flip()
Cumulative count (~Pareto chart)
If you're interested in the share of total count, you might also look at the cumulative count and see how the big values make up a large share:
object_count %>%
mutate(cuml = cumsum(count)) %>%
ggplot(aes(obj_num)) +
geom_tile(aes(y = count + lag(cuml, default = 0),
height = count))

Assigning specific colors to specific cases in ridgeline plots in R

Recently this community helped me tremendously with getting Ridgeline plots to work with my data.
Now I am struggling with coloring them according to my needs.
Basically what I want is plotting my cases in different orders but they should keep a specific color so observations remain recognizable even when plotted in a different order. So far I failed with applying the available solutions to my requirements.
Let us take for example this data, where we have a name, a mean and an SD:
caseName caseMean caseSD
Svansdottir 2006 -0.0646 0.4032398
Guétin 2009 -0.4649 0.3995663
Raglio 2010a -0.2145 0.2814031
Let's first sort them by caseMean:
df$caseName <- factor(df$caseName, levels = df$caseName[order(df$caseMean)])
and plot it with the following code:
library(tidyverse); library(ggridges)
n = 100
df3 <- df %>%
mutate(low = caseMean - 3 * caseSD, high = caseMean + 3 * caseSD) %>%
uncount(n, .id = "row") %>%
mutate(x = (1 - row/n) * low + row/n * high,
norm = dnorm(x, caseMean, caseSD))
ggplot(df3, aes(x, caseName, height = norm, fill=caseName)) +
geom_ridgeline(scale = 2,alpha=0.75) +
scale_fill_viridis_d()
we get this:
Now we reverse the order
df$caseName <- factor(df$caseName, levels = df$caseName[order(-df$caseMean)])
and plot again with the code above we see that the plots have switched color:
How can I make sure that the same cases have always the same colors no matter the order I put them in?
I would like to have code that doesn't require me to to "hard-wire" colors to a specific case name. I want to be able to do this to ridgeline plots with 20, 30, or more observations. The fact that I picked the viridis color palette doesn't matter. I am happy with any solution (like with heat.colors or something similar).
If your new factor is just reversing the order of the previous one, you could use the argument direction in scale_fill_viridis_d().
For more complicated cases (i.e. re-leveling a factor), a possibility is to add the colour manually, possibly in your orginal data-frame, and to feed it with scale_fill_manual()
simple case: reversing order of factor
library(tidyverse)
df <- data.frame(name = letters[3:1], value = c(3,1,2))
pl_1 <- ggplot(aes(x=name, y=value, fill=name), data=df)+
geom_col() +
scale_fill_viridis_d()
pl_1
pl_1 %+% mutate(df, name = factor(name, levels = c("c", "b", "a"))) +
scale_fill_viridis_d(direction=-1)
#> Scale for 'fill' is already present. Adding another scale for 'fill',
#> which will replace the existing scale.
More complicated case
library(tidyverse)
library(viridis)
df_new <- tibble(name = letters[3:1], value = c(3,1,2),
col = rev(viridis(3))) %>%
mutate(name = factor(name, levels = c("c", "b", "a"))) %>%
arrange(name)
df_new %>%
ggplot(aes(x=name, y=value, fill=name)) +
geom_col() +
scale_fill_manual(values=df_new$col)
Created on 2019-06-06 by the reprex package (v0.3.0)

r percentage by bin in histogram ggplot

I have a data set like this ->
library(ggplot2)
response <- c("Yes","No")
gend <- c("Female","Male")
purchase <- sample(response, 20, replace = TRUE)
gender <- sample(gend, 20, replace = TRUE)
df <- as.data.frame(purchase)
df <- cbind(df,gender)
so head(df) looks like this ->
purchase gender
1 Yes Female
2 No Male
3 No Female
4 No Female
5 Yes Female
6 No Female
Also, so you can validate my examples, here is table(df) for my particular sampling.
(please don't worry about matching my percentages)
gender
purchase Female Male
No 6 3
Yes 4 7
I want a "histogram" showing Gender, but split by Purchase.
I have gone this far ->
ggplot(df) +
geom_bar(aes(y = (..count..)/sum(..count..)),position = "dodge") +
aes(gender, fill = purchase)
which generates ->
histogram with split bins, by percentage, but not the aggregate level I want
The Y axis has Percentage as I want, but it has each bar of the chart as a percentage of the whole chart.
What I want is the two "Female" bars to each be a percentage of there respective "Purchase". So in the chart above I would like four bars to be,
66%, 36%, 33%, 64%
, in that order.
I have tried with geom_histogram to no avail. I have checked SO, searched, ggplot documentation, and several books.
Regarding the suggestion to look at the previous question about facets; that does work, but I had hoped to keep the chart visually as it is above, as opposed to split into "two charts". So...
Anyone know how to do this?
Thanks.
Try something like this:
library(tidyverse)
df %>%
count(purchase, gender) %>%
ungroup %>%
group_by(gender) %>%
mutate(prop = prop.table(n)) %>%
ggplot(aes(gender, prop, group = purchase)) +
geom_bar(aes(fill = purchase), stat = "identity", position = "dodge")
The first 5 lines create a column prop (for "proportion"), which aggregates across gender.
To get there, you first count each purchase by gender (similar to the output of table(df). Ungrouping then regrouping only by gender gives the aggregation we want.
Regarding the percentages you want, is the denominator based on gender, or purchase? In the example given above, 66% for female & no purchase would be a result of 6 divided by the sum of no purchases (6+3) rather than the sum of all females (6+4).
It's definitely possible to plot that, but I'm not sure if the result would be intuitive to interpret. I got confused myself for a while.
The following hack makes use of the weight aesthetic. I've used purchase as the grouping variable here based on the expected output described in the question, though I think gender makes more sense (as per TTNK's answer above):
df <- data.frame(purchase = c(rep("No", 6), rep("Yes", 4), rep("No", 3), rep("Yes", 7)),
gender = c(rep("Female", 10), rep("Male", 10)))
ggplot(df %>%
group_by(purchase) %>% #change this to gender if that's the intended denominator
mutate(w = 1/n()) %>% ungroup()) +
aes(gender, fill = purchase, weight = w)+
geom_bar(aes(x = gender, fill = purchase), position = "dodge")+
scale_y_continuous(name = "percent", labels = scales::percent)

Resources