Stacked bar plot in violin plot shape - r

Maybe this is a stupid idea, or maybe it's a brain wave. I have a dataset of lipid classes in 4 different species. The data is proportional, and the sums are 1000. I want to visualise the differences in proportions for each class in each species. Generally a stacked bar would be the way to go here, but there are several classes, and it becomes uninterpretable since only the bottom class shares a baseline (see below).
And this appears to be the best option of a bad bunch, with pie and donut charts being nothing short of sneered at.
I was then inspired by this creation Symmetrical, violin plot-like histogram?, which creates a sort of stacked distribution violin plot (see below).
I am wondering if this could somehow be converted into a stacked violin, such that each segment represents a whole variable. In the case of my data, species' A and D would be 'fat' around the TAG segment, and 'skinnier' at the STEROL segment. This way the proportions are depicted horizontally, and always have a common baseline. Thoughts?
Data:
structure(list(Sample = c("A", "A", "A", "B", "B", "B", "C",
"C", "C", "D", "D"), WAX = c(83.7179798600773, 317.364310355766,
20.0147496567679, 93.0194886619568, 78.7886829173726, 79.3445694220837,
91.0020522660375, 88.1542855137005, 78.3313314713951, 78.4449591023115,
236.150030864875), TAG = c(67.4640254081232, 313.243238213156,
451.287867136276, 76.308508343969, 40.127554151831, 91.1910102221636,
61.658394708941, 104.617259648364, 60.7502685224869, 80.8373642262043,
485.88633863193), FFA = c(41.0963382465756, 149.264019576272,
129.672579626868, 51.049208042632, 13.7282635713804, 30.0088572108344,
47.8878116348504, 47.9564218319094, 30.3836532949481, 34.8474205480686,
10.9218910757234), `DAG1,2` = c(140.35876401479, 42.4556176551009,
0, 0, 144.993393432366, 136.722412691012, 0, 140.027443968931,
137.579074961889, 129.935353616471, 46.6128854387559), STEROL = c(73.0144390122309,
24.1680929257195, 41.8258704279641, 78.906816661241, 67.5678558060943,
66.7150537517493, 82.4794113296791, 76.7443442992891, 68.9357008866253,
64.5444668132533, 29.8342694785768), AMPL = c(251.446564854412,
57.8713327050339, 306.155806819949, 238.853696442419, 201.783872969561,
175.935515655693, 234.169038776536, 211.986239116884, 196.931330316831,
222.658181144794, 73.8944654414811), PE = c(167.99718650752,
43.3839497916674, 22.1937177530762, 150.315149187176, 153.632530721031,
141.580725482114, 164.215442147509, 155.113323256627, 143.349000132624,
128.504657216928, 50.6281347160092), PC = c(174.904702096271,
52.2494387772846, 28.8494085790995, 191.038328534942, 190.183655117756,
175.33290326259, 199.2632149392, 175.400682364295, 176.64926273487,
163.075864395099, 66.071984352649), LPC = c(0, 0, 0, 120.508804125665,
109.194191312608, 103.16895230176, 119.324634197247, 0, 107.09037767833,
97.151732936871, 0)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -11L), .Names = c("Sample", "WAX", "TAG",
"FFA", "DAG1,2", "STEROL", "AMPL", "PE", "PC", "LPC"))

This is essentially a horizontal bar plot:
library(reshape2)
DFm <- melt(DF, id.vars = "Sample")
DFm1 <- DFm
DFm1$value <- -DFm1$value
DFm <- rbind(DFm, DFm1)
ggplot(DFm, aes(x = "A", y = value / 10, fill = variable, color = variable)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
theme_minimal() +
facet_wrap(~ Sample, nrow = 1, switch = "x") +
theme(axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank())

Related

logarithmic y-axis issue in R/ ggplot2

I plotted a histogram from a frequency distribution table using ggplot2. Here is some sample data
dput(test_data)
structure(list(inst = c(5, 5, 5, 10, 10, 10, 15, 15, 15), equip = c("a",
"b", "c", "a", "b", "c", "a", "b", "c"), value = c(0.520670542493463,
0.7556017707102, 0.931902746669948, 0.206132101127878, 0.0114199279341847,
0.603053622646257, 0.315444506937638, 0.375196750741452, 0.983124621212482
)), class = "data.frame", row.names = c(NA, -9L))
When I use ggplot2 to plot the data, I get the following output:
test_hist1 <- ggplot(test_data,aes(x = inst, y =value, fill = equip)) + geom_bar(width=3,alpha=1,stat = "dodge", position ="stack")+theme_bw()+xlab(expression(Value))+ylab("value") + ggtitle(expression(test~data))+theme(plot.title = element_text(hjust = 0.5))+scale_fill_manual(values=c("#00FF00", "#FFD700","#DC143C"))
But when I transform the y_axis to be a log_axis, the plot direction changes and so does the intensity of the bars.
test_hist2 <- ggplot(test_data,aes(x = inst, y =value, fill = equip)) + geom_bar(width=3,alpha=1,stat = "dodge", position ="stack")+theme_bw()+xlab(expression(Value))+ylab("log_yaxis") + ggtitle(expression(test~data))+theme(plot.title = element_text(hjust = 0.5))+scale_fill_manual(values=c("#00FF00", "#FFD700","#DC143C"))+scale_y_log10()
My second plot is wrong, because the code for second plot is just converting my y-axis number to log10(y_axis_value) instead of a log_axis that is given in the following answer (the plot in the answer is the axis I am looking for). Can someone direct me in the right direction. Thanks for the help.
R: Difference between log axis scale vs. manual log transformation?

Label or Highlight Specific Rows in ggplot2

I have a great looking geom_tile plot, but I need a way to highlight specific rows or label specific rows based on a binary value.
Here is a small subset of data in wide format and resulting output:
df <- structure(list(bin_level = c(0,1), sequence = c("L19088.1", "chr1_43580199_43586187"), X236 = c("G", "."), X237 = c("G", "."), X238 = c("A", "a"),
X239 = c("T", "C"), X240 = c("A", "c"), X241 = c("G", "G"
)), class = "data.frame", row.names = 1:2)
> df
bin_level sequence X236 X237 X238 X239 X240 X241
1 0 L19088.1 G G A T A G
2 1 chr1_43580199_43586187 . . a C c G
The actual dataset is much larger, with 1045 observations of 3096 variables.
My goal is to plot this massive dataset as a heatmap with colors for each different nucleotide and be able to differentiate between rows with bin_levels of 0 and 1.
The following code makes a great plot, but doesn't include the bin_level differences I need to see. I would like to highlight the entire row if the bin_level is 1, but I haven't been able to find anything on how to do such a thing. I am already using nucleotides for the aes fill variable, so I need something else. The best option I've come up with so far is to color the row labels. I used info from this post to try an ifelse statement to color based on the bin_level variable.
The biggest problems here are
Row axis titles are much too long and too many to look good
There are only 53 bin_level rows with a 1 (of 1045 total), so why does it look like a LOT more red than there should be?
I want the red labels (bin_level =1's) at the top of the plot, and the mix of black/red makes me think my arrange(bin_level) piece isn't working right.
Please let me know if you know of a better way to accomplish what I'm trying to accomplish, or can help make my code work better than it is currently. Thank you!
df %>%
## reshape to long table
## (one column each for sequence, position and nucleotide):
pivot_longer(-c("Sequence", "bin_level"), ## stack all columns *except* sequence and bin_level
names_to = 'position',
values_to = 'nucleotide'
) %>%
arrange(bin_level) %>%
## create the plot:
ggplot() +
geom_tile(aes(x = position, y = Sequence, fill = nucleotide),
height = 1 ## adjust to visually separate sequences
) +
scale_fill_manual(values = c('a'='#ea0064', 'c'='#008a3f', 'g'='#116eff',
't'='#cf00dc', '\U00B7'='#000000', 'X' ='#ffffff'
)
) +
labs(x = 'x-axis-title', y='Sequence') +
## remove x-axis (=position) elements: they'll probably be too dense:
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_text(colour = ifelse(levels(df$bin_level)==1, "red", "black"))
)
While passing a vector of colors to element_text() is a quick option in some cases IMHO in more general cases it is error prone and requires to keep an eye on the way you ordered your data. Instead I would suggest to have a look at the ggtext package which introduces the theme element element_markdown and allows for styling text using some HTML, CSS and markdown.
Moreover, besides the issue already pointed out by #I_O another issue is that you wrangle the data manipulation steps together with the plotting code in one pipeline. As a consequence while you arrange your data by bin_level you use the original unmanipulated, unarranged dataset df which by the way is still in wide format for the color assignment. That's why personally I would always recommend to split the data wrangling and the plotting except for very simple cases.
Finally, while your arranged your data by bin_level what really matters is the order of sequence, i.e. you have to set the order of sequence after arranging for which I use forecast::fct_inorder.
Note: To make your example more realistic I duplicated your dataset to add two more rows.
library(tidyr)
library(dplyr)
library(ggplot2)
df_long <- df %>%
pivot_longer(-c("sequence", "bin_level"),
names_to = "position",
values_to = "nucleotide"
) %>%
arrange(bin_level) %>%
mutate(
sequence = if_else(bin_level == 1, paste0("<span style='color: red'>", sequence, "</span>"), sequence),
sequence = forcats::fct_inorder(sequence))
ggplot(df_long) +
geom_tile(aes(x = position, y = sequence, fill = nucleotide),
height = 1
) +
scale_fill_manual(values = c(
"a" = "#ea0064", "c" = "#008a3f", "g" = "#116eff",
"t" = "#cf00dc", "\U00B7" = "#000000", "X" = "#ffffff"
)) +
labs(x = "x-axis-title", y = "Sequence") +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = ggtext::element_markdown()
)
DATA
df <- structure(list(
bin_level = c(0, 1), sequence = c("L19088.1", "chr1_43580199_43586187"), X236 = c("G", "."), X237 = c("G", "."), X238 = c("A", "a"),
X239 = c("T", "C"), X240 = c("A", "c"), X241 = c("G", "G")
), class = "data.frame", row.names = 1:2)
df1 <- structure(list(
bin_level = c(0, 1), sequence = c("L19088.2", "chr1_43580199_43586187.2"), X236 = c("G", "."), X237 = c("G", "."), X238 = c("A", "a"),
X239 = c("T", "C"), X240 = c("A", "c"), X241 = c("G", "G")
), class = "data.frame", row.names = 1:2)
df <- dplyr::bind_rows(df, df1)
While you arrange the data by bin level before feeding it into ggplot, the plot's vertical arrangement follows the y-value (which is: sequence). You could create a combination of bin_level and sequence to arrange and plot the data by:
df %>%
...
## reformat bin_level to a three-digit character, so that
## 002 properly precedes 011 (otherwise 11 would come before 2)
mutate(dummy = paste(sprintf('%03.0f', bin_level),
Sequence, sep = '_')) %>%
arrange(dummy) %>%
...
## ggplot instructions:
ggplot() + ... +
geom_tile(aes(y = dummy, ...)) +
## remove the bin_level prefix ('00x_') for labelling:
scale_y_discrete(labels = gsub('.*_', '', df$dummy)) +
... +
theme(axis.text.y = element_text(
## note: df$bin_level NOT levels(df$bin_level)
colour = ifelse(df$bin_level == 1, "red", "black"))
)
mind that using element_text to colour labels might not function in the future:
Vectorized input to element_text() is not officially supported.
Results may be unexpected or may change in future versions of ggplot2.
(console warning)

two different legends from one dataset

I am trying to have two legends: one based on variable c and the other on variable d, defined by their own shape and size. I do know if this is possible in ggplot2? Maybe it is not fitting to the philosophy behind the use of ggplot2. If I transform the data to long format, I can deal with the different shapes, but the sizes are confounded. The same is happening if I use a facet_wrap option.
structure(list(a = c(5, 6, 7), b = c(5, 6, 7), c = c(0.1, 0.5,
1), d = c(10, 5, 1)), .Names = c("a", "b", "c", "d"), row.names = c(NA,
-3L), class = "data.frame")
library(ggplot2)
plot <- ggplot() + geom_point(data=e,aes(x=a,y=b,size=c), shape=1,
color="black")
plot <- plot + geom_point(data=e,aes(x=a,y=b,size=d), shape=3, color="red")
plot
Any advice is more than welcome.
you can write shape and size in aes() like geom_point(aes(x=a,y=b,shape=factor(c))) +geom_point(aes(x=a,y=b,size=d), shape=3). For example,
library(ggplot2)
ggplot(mpg) + geom_point(aes(x=hwy,y=cty,shape=class)) +
geom_point(aes(x=hwy,y=cty,size=cyl), shape=3)

How to point each plot to correct y axis (many plots, two y axes, in R with ggplot2)

So I have compared two groups with a third using a range of inputs. For each of the three groups I have a value and a confidence interval for a range of inputs. For the two comparisons I also have a p-value for that range of inputs. Now I would like to plot all five data series, but use a second axis for the p values.
I am able to do that except for one thing: how do I make sure that R knows which of the plots to assign to the second axis?
This is what it looks like now. The bottom two data series should be scaled up to the Y axis to the right.
ggplot(df) +
geom_pointrange(aes(x=x, ymin=minc, ymax=maxc, y=meanc, color="c")) +
geom_pointrange(aes(x=x, ymin=minb, ymax=maxb, y=meanb, color="b")) +
geom_pointrange(aes(x=x, ymin=mina, ymax=maxa, y=meana, color="a")) +
geom_point(aes(x=x, y=c, color="c")) +
geom_point(aes(x=x, y=b, color="b")) +
scale_y_continuous(sec.axis = sec_axis(~.*0.2))
df is a dataframe whose column names are all the variables you see listed above, all row values are the corresponding datapoints.
You can get what you want, staying true to Hadley's cannon and Grammar of Graphics gospel, if you transform your DF from wide to long, and employ a different aes (i.e. shape, color, fill) between means and CI.
You did not provide a reproducible example, so I employ my own. (Dput at the end of the post)
df2 <- df %>%
mutate(CatCI = if_else(is.na(CI), "", Cat)) # Create a categorical name to map the CI to the legend.
ggplot(df2, aes(x = x)) +
geom_pointrange(aes(ymin = min, ymax = max, y = mean, color = Cat), shape = 16) +
geom_point(data = dplyr::filter(df2,!is.na(CI)), ## Filter the NA within the CI
aes(y = (CI/0.2), ## Transform the CI's y position to fit the right axis.
fill = CatCI), ## Call a second aes the aes
shape = 25, size = 5, alpha = 0.25 ) + ## I changed shape, size, and fillto help with visualization
scale_y_continuous(sec.axis = sec_axis(~.*0.2, name = "P Value")) +
labs(color = "Linerange\nSinister Axis", fill = "P value\nDexter Axis", y = "Mean")
Result:
Dataframe:
df <- structure(list(Cat = c("a", "b", "c", "a", "b", "c", "a", "b",
"c", "a", "b", "c", "a", "b", "c"), x = c(2, 2, 2, 2.20689655172414,
2.20689655172414, 2.20689655172414, 2.41379310344828, 2.41379310344828,
2.41379310344828, 2.62068965517241, 2.62068965517241, 2.62068965517241,
2.82758620689655, 2.82758620689655, 2.82758620689655), mean = c(0.753611797661977,
0.772340941644911, 0.793970086962944, 0.822424652072316, 0.837015408776649,
0.861417383841253, 0.87023105762465, 0.892894201949377, 0.930096326498796,
0.960862178366363, 0.966600321596147, 0.991206984637544, 1.00714201832596,
1.02025006679944, 1.03650896186786), max = c(0.869753641121797,
0.928067675294351, 0.802815304215019, 0.884750162053761, 1.03609814491961,
0.955909854315582, 1.07113399603486, 1.02170928767791, 1.05504846273091,
1.09491706586801, 1.20235615364205, 1.12035782960649, 1.17387406039167,
1.13909154635088, 1.0581878034897), min = c(0.632638511783381,
0.713943701135991, 0.745868763626567, 0.797491261486603, 0.743382797144923,
0.827693203320894, 0.793417962991821, 0.796917421637021, 0.92942504556723,
0.89124101157585, 0.813058838839382, 0.91701749675892, 0.943744642652422,
0.912869230576973, 0.951734254896252), CI = c(NA, 0.164201137643034,
0.154868406784159, NA, 0.177948094206453, 0.178360305763648,
NA, 0.181862670931493, 0.198447350829814, NA, 0.201541499248143,
0.203737532636542, NA, 0.205196077692786, 0.200992205838595),
CatCI = c("", "b", "c", "", "b", "c", "", "b", "c", "", "b",
"c", "", "b", "c")), .Names = c("Cat", "x", "mean", "max",
"min", "CI", "CatCI"), row.names = c(NA, 15L), class = "data.frame")

ggplot scatter plot of two groups with superimposed means with X and Y error bars

How can I generate a ggplot2 scatterplot of two groups with the means indicated together with X and Y error bars, like this?
Here is a reduced example (using dput to recreate the data.frame df) with two groups of cells and three measures, and I'd like to say plot Peak against Rise, or Peak against Decay. That much is straightforward, but I would like to add points indicating the group means with X and Y error bars (+/- sem).
Is there a way to do this within ggplot2, or do I need to generate means and sem values first? This post draw my attention to geom_errorbarh but I'm still uncertain as to the best way to proceed.
library(ggplot2)
df<-structure(list(Group = c("A", "A", "A", "A", "A", "A", "A",
"A", "B", "B", "B", "B", "B", "B", "B", "B"), Peak = c(102.975,
37.805, 64.996, 66.36, 199.354, 7.425, 34.137, 366.59, 10.165,
14.833, 702.525, 39.086, 8.286, 122.783, 105.762, 37.018), Rise = c(0.346855,
0.24165, 0.24028, 0.461548, 0.194016, 0.164047, 0.484375, 0.307861,
0.438538, 0.488083, 0.549423, 0.365448, 0.511551, 0.33596, 0.331467,
0.270096), Decay = c(1.3874, 1.07407, 1.88787, 2.64408, 1.1462,
0.615963, 4.04641, 1.48701, 3.61397, 4.1838, 1.92746, 3.64329,
4.21354, 0.812695, 1.14611, 1.28279)), .Names = c("Group",
"Peak", "Rise", "Decay"), class = "data.frame", row.names = c(NA,
-16L))
ggplot(df, aes(Peak, Rise)) +
geom_point(aes(colour=Group)) +
theme_bw(14)
I have tried something like:
library(doBy)
sem <- function(x) sqrt(var(x)/length(x))
z<-summaryBy(Peak+Rise+Decay~Group, data=df, FUN=c(mean,sem))
z
to get the values, but easily (and flexibly) incorporating them into the ggplot code is defeating me.
I tend to use plyr for these kinds of summaries:
z <- ddply(df,.(Group),summarise,
Peak = mean(Peak),
Rise = mean(Rise),
PeakSE = sqrt(var(Peak))/length(Peak),
RiseSE = sqrt(var(Rise))/length(Rise))
ggplot(df,aes(x = Peak,y = Rise)) +
geom_point(aes(colour = Group)) +
geom_point(data = z,aes(colour = Group)) +
geom_errorbarh(data = z,aes(xmin = Peak - PeakSE,xmax = Peak + PeakSE,y = Rise,colour = Group,height = 0.01)) +
geom_errorbar(data = z,aes(ymin = Rise - RiseSE,ymax = Rise + RiseSE,x = Peak,colour = Group))
I confess I was a little disappointed that I had to manually tweak the crossbar height. But thinking about it, I guess that could be fairly challenging to implement.

Resources