Highlight Points in Scatterplot w/ ggplot2 - r

I need to create a qq plot of -log10 p-values in ggplot2 where a subset of 137 points ("targets") are highlighted in gold using a colorblind-friendly palette I'm using called cbbPalette. I cannot do this in an alternate package because I eventually need to combine multiple qq plots into a grid using grid.arrange from the gridExtra package that works with ggplot2.
Setup:
library(ggplot2)
library(reshape2)
cbbPalette <- c("#E69F00", "#000000") #part of my palette; gold & black
set.seed(100)
The data consists of 100,137 p-values, 137 of which are targets:
p_values = c(
runif(100000, min = 0, max = 1),
runif(132, min = 1e-7, max = 1),
c(6e-20, 6e-19, 7e-9, 7.5e-9, 4e-8)
)
#labels for the p-values
names_letters <-
do.call(paste0, replicate(2, sample(LETTERS, 100137, TRUE), FALSE))
names = paste0(names_letters, sprintf("%04d", sample(9999, 100137, TRUE)))
targets = names[100001:100137] #last 137 are targets
df = as.data.frame(p_values)
df$names = names
df <-
df[sample(nrow(df)), ] #shuffles the df to place targets randomly w/in
df$Category = ifelse(df$names %in% targets, "Target", "Non-Target")
Appearance of Data:
head(df, 4)
p_values names Category
89863 0.4821147 NZ3385 Non-Target
20209 0.3998835 SQ3793 Non-Target
29200 0.7893478 ZT5497 Non-Target
71623 0.3459360 QF5311 Non-Target
Melted df Using reshape2 with Observed (o) & Expected (e) -log10 p-values:
df.m = melt(df)
df.m$o = -log10(sort(df.m$value, decreasing = F))
df.m$e = -log10(1:nrow(df.m) / nrow(df.m))
Appearance of Melted df:
head(df.m,4)
names Category variable value o e
1 NZ3385 Non-Target p_values 0.4821147 19.221849 5.000595
2 SQ3793 Non-Target p_values 0.3998835 18.221849 4.699565
3 ZT5497 Non-Target p_values 0.7893478 8.154902 4.523473
4 QF5311 Non-Target p_values 0.3459360 8.124939 4.398535
QQ-plot
df_qq = ggplot(df.m, aes(e, o)) +
geom_point(aes(color = Category)) +
scale_colour_manual(values = cbbPalette) +
geom_abline(intercept = 0, slope = 1) +
ylab("Observed -log[10](p)") +
xlab("Theoretical -log[10](p)")
I then get a qq with no highlighting of my 137 targets.

You can draw the targets in a separate geom_point() call after the non-targets, the geoms are plotted in order so the targets end up on top:
cbbPalette <- c(Target = "#E69F00", `Non-Target` = "#000000")
df_qq = ggplot(df.m, aes(e, o)) +
geom_abline(intercept = 0, slope = 1) +
geom_point(aes(color = Category), data = df.m[df.m$Category == "Non-Target", ]) +
geom_point(aes(color = Category), data = df.m[df.m$Category == "Target", ]) +
scale_colour_manual(values = cbbPalette) +
ylab("Observed -log[10](p)") +
xlab("Theoretical -log[10](p)")
I've also added names to your palette to make sure the right colours are attached to each category, when changing the order of the geom_point() calls this can get mixed up otherwise.
Result:

If you want to avoid having to split your dataframe into two calls to geom_point, you can order the data by the Category column first, then pipe it into ggplot. For just these two category values, you could arrange pretty simply:
df.m %>%
arrange(Category) %>%
ggplot(...)
which will put your data in alphabetical order with Non-Target observations, then Target ones. Points get drawn in order, so this will put points in the target category on top.
To have more control over the ordering, you can make Category a factor, and set the levels explicitly, then arrange by the factor order:
df.m %>%
mutate(Category = as.factor(Category) %>% fct_relevel("Target")) %>%
arrange(desc(Category)) %>%
ggplot(...)
I'm using fct_relevel from the forcats package, just because it's a really easy way to manipulate factor levels; you could order levels with base R as well. fct_relevel puts the Target level first, so when I arrange by Category, I'm doing it in reverse, so that again Target gets drawn last.
Hope that makes sense!

Related

matching of shape, color and legend in bubble plot with subset of variable

I have some data
library(data.table)
wide <- data.table(id=c("A","C","B"), var1=c(1,6,1), var2=c(2,6,5), size1=c(11,12,13), size2=c(10,12,10), flag=c(FALSE,TRUE,FALSE))
> wide
id var1 var2 size1 size2 flag
1: A 1 2 11 10 FALSE
2: C 6 6 12 12 TRUE
3: B 1 5 13 10 FALSE
which I would like to plot as bubble plots where id is ordered by var2, and bubbles are as follows:
ID A and B: var1 is plotted in size1 and "empty bubbles" and var2 is plotted in size2 with "filled" bubbles.
ID C is flagged because there is only one value (this is why var1=var2) and it should have a "filled bubble" of a different color.
I have tried this as follows:
cols <- c("v1"="blue", "v2"="red", "flags"="green")
shapes <- c("v1"=16, "v2"=21, "flags"=16)
p1 <- ggplot(data = wide, aes(x = reorder(id,var2))) + scale_size_continuous(range=c(5,15))
p1 <- p1 + geom_point(aes(size=size1, y = var1, color = "v1", shape = "v1"))
p1 <- p1 + geom_point(aes(size=size2, y = var2, color = "v2", shape = "v2", stroke=1.5))
p1 <- p1 + geom_point(data=subset(wide,flag), aes(size=size2[flag], y=var2[flag], color= "flags", shape="flags"))
p1 <- p1 + scale_color_manual(name = "test",
values = cols,
labels = c("v1", "v2", "flags"))
p1 <- p1 + scale_shape_manual(name = "test",
values = shapes,
labels = c("v1", "v2", "flags"))
which gives (in my theme)
but two questions remain:
What happened to the order in the legend? I have followed the recipe of the bottom solution in Two geom_points add a legend but somehow the order does not match.
How to get rid of the stroke around the green bubble and why is it there?
Overall, something appears to go wrong in matching shape and color.
I admit, it took me a while to understand your slightly convoluted plot. Forgive me, but I have allowed myself to change the way to plot, and make (better?) use of ggplot.
The data shape is less than ideal. ggplot works extremely well with long data.
It was a bit of a guesswork to reshape your data, and I decided to go the quick and dirty way to simply bind the rows from selected columns.
Now you can see, that you can achieve the new plot with a single call to geom_point. The rest is "scale_aesthetic" magic...
In order to combine the shape and color legend, safest is to use override.aes. But beware! It does not take named vectors, so the order of the values needs to be in the exact order given by your legend keys - which is usually alphabetic, if you don't have the factor levels defined.
update re: request to order x labels
This hugely depends on the actual data structure. if it is originally as you have presented, I'd first make id a factor with the levels ordered based on your var2. Then, do the data shaping.
library(tidyverse)
# data reshape
wide <- data.frame(id=c("C","B","A"), var1=c(1,6,1), var2=c(2,6,5), size1=c(11,12,13), size2=c(10,12,10), flag=c(FALSE,TRUE,FALSE))
wide <- wide %>% mutate(id = reorder(id, var2))
wide1 <- wide %>% filter(!flag) %>%select(id, var = var1, size = size1)
wide2 <- wide %>% filter(!flag) %>% select(id, var = var2, size = size2)
wide3 <- wide %>% filter(flag) %>% select(id, var = flag, size = size2) %>%
mutate(var = 6)
long <- bind_rows(list(v1 = wide1, v2 = wide2, flag = wide3), .id = "var_id")
# rearrange the vectors for scales aesthetic
cols <- c(flag="green", v1 ="blue", v2="red" )
shapes <- c(flag=16, v1=16, v2 =21 )
ggplot(data = long, aes(x = id, y = var)) +
geom_point(aes(size=size, shape = var_id, color = var_id), stroke=1.5) +
scale_size_continuous(limits = c(5,15),breaks = seq(5,15,5)) +
scale_shape_manual(name = "test", values = shapes) +
scale_color_manual(values = cols, guide = FALSE) +
guides(shape = guide_legend(override.aes = list(color = cols)))
P.S. the reason for the red stroke around the green bubble in your plot is that you also plotted the 'var2' behind your flag.
Created on 2020-04-08 by the reprex package (v0.3.0)

ggplot2: multiple variables on x-axis at multiple times

I have a data frame for observation numbers (3 observations for same id), height, weight and fev that looks like this (just for example):
id obs height weight fev
1 1 160 80 90
1 2 150 70 85
1 3 155 76 87
2 1 140 67 91
2 2 189 78 71
2 3 178 86 89
I need to plot this data using ggplot2 such that on x-axis there are 3 variables height, weight, fev; and the observation numbers are displayed as 3 vertical lines for each variable (color coded), where each lines show a median as a solid circle, and 25th and 75th percentiles as caps at the upper and lower extremes of the line (no minimum or maximum needed). I have so far tried many variations of box plots but I am not even getting close. Any suggestion(s) how to approach or solve this?
Thanks
OK instead what I did below was make three graphs then piece together with gridExtra. Read more about package here: http://www.sthda.com/english/wiki/wiki.php?id_contents=7930
I took the common legend code from this site to produce the following, starting with our existing longdf2. By piecing together the graphs, the information about corresponding observation is within the title of the graph
id <- rep(1:12, each = 3)
obs <- rep(1:3, 12)
height <- seq(140,189, length.out = 36)
weight <- seq(67,86, length.out = 36)
fev <- seq(71,91, length.out = 36)
df <- as.data.frame(cbind(id,obs,height, weight, fev))
obsonly <- melt(df, id.vars = c('id'), measure.vars = 'obs')
obsonly <- rbind(obsonly,obsonly,obsonly)
newvars <- melt(df[-2],id.vars = 'id')
longdf2 <- cbind(obsonly,newvars)
longdf2 <- longdf2[-4] #dropping second id column
colnames(longdf2)[c(2:5)] <- c('obs', 'obsnum', 'variable', 'value')
#Make graph 1 of observation 1
g1 <- longdf2 %>%
dplyr::filter(obsnum == 1) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 1") +
theme(plot.title = element_text(hjust = 0.5)) #has a legend
g2 <- longdf2 %>%
dplyr::filter(obsnum == 2) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 2") +
theme(plot.title = element_text(hjust = 0.5), legend.position =
'none')
#specified as none to make common legend at end
g3 <- longdf2 %>%
dplyr::filter(obsnum == 3) %>%
ggplot(aes(x = variable, y = value, color = variable)) +
stat_summary(fun.data=median_hilow) +
labs(title = "Observation 3") +
theme(plot.title = element_text(hjust = 0.5), legend.position =
'none')
library(gridExtra)
get_legend<-function(myggplot){
tmp <- ggplot_gtable(ggplot_build(myggplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)
}
# Save legend
legend <- get_legend(g1)
# Remove legend from 1st graph
g1 <- g1 + theme(legend.position = 'none')
# Combine graphs
grid.arrange(g1, g2, g3, legend, ncol=4, widths=c(2.3, 2.3, 2.3, 0.8))
Plenty of other little tweaks you could make along the way
Try putting the data into long format prior to graphing. I generated some more data, 12 subjects, each with 3 observations.
id <- rep(1:12, each = 3)
obs <- rep(1:3, 12)
height <- seq(140,189, length.out = 36)
weight <- seq(67,86, length.out = 36)
fev <- seq(71,91, length.out = 36)
df <- as.data.frame(cbind(id,obs,height, weight, fev))
library(reshape2) #use to melt data from wide to long format
longdf <- melt(df,id.vars = c('id', 'obs'))
Don't need to define measure variables here since the id.vars are defined, the remaining non-id.vars automatically default to measure variables. If you have more variables in your data set, you'll want to define measure variables in that same line as: measure.vars = c("height,"weight","fev")
longdf <- melt(df,id.vars = c('id', 'obs'), measure.vars = c("height", "weight", "fev"))
Apologies, haven't earned enough votes to put figures into my responses
ggplot(data = longdf, aes(x = variable, y = value, fill = factor(obs))) +
geom_boxplot(notch = T, notchwidth = .25, width = .25, position = position_dodge(.5))
This does not produce the exact graph you described-- which sounded like it was geom_linerange or something similar? -- those geoms require an x, ymin, and ymax to draw. Otherwise a regular, 'ole boxplot has your 1st and 3rd IQRs and median marked. I adjusted parameters of the boxplot to make it thinner with notches and widths, and separated them slightly with the position_dodge(.5)
after reading your response, I edited my original answer
You could try facet_wrap -- and watch the exchanging of "fill" vs. "color" in ggplot. If an object can't be "filled" with a color, like a boxplot or distribution, then it has to be "colored" with a color. Use color instead in the original aes()
ggplot(data = longdf, aes(x = variable, y = value, color = factor(obs))) +
stat_summary(fun.data=median_hilow) + facet_wrap(.~obs)
This gives you observation 1 - height, weight, fev side by side, observation 2- height, ....
If that still isn't what you want perhaps more like height observation 1,2,3; weight observation 1,2,3...then you'll need to modify your melting to have two variable and two value columns. Essentially make two melted dataframes, then cbind. Annnnd because each observation has three variables, you'll need to rbind to make sure both data frames have the same number of rows:
obsonly <- melt(df, id.vars = c('id'), measure.vars = 'obs')
obsonly <- rbind(obsonly,obsonly,obsonly) #making rows equal
longvars <- melt(df[-2],id.vars = 'id') #dropping obs from melt
longdf2 <- cbind(obsonly,longvars)
longdf2 <- longdf2[-4] #dropping second id column
colnames(longdf2)[c(2:5)] <- c('obs', 'obsnum', 'variable', 'value')
ggplot(data = longdf2, aes(x = obsnum, y = value,
color = factor(variable))) +
stat_summary(fun.data=median_hilow) +
facet_wrap(.~variable)
From here you can play around with the x axis marks (probably isn't useful to have a 1.5 observation marked) and the spacing of the lines from each other

ggplot2 - Two color series in area chart

I've got a question regarding an edge case with ggplot2 in R.
They don't like you adding multiple legends, but I think this is a valid use case.
I've got a large economic dataset with the following variables.
year = year of observation
input_type = *labor* or *supply chain*
input_desc = specific type of labor (eg. plumbers OR building supplies respectively)
value = percentage of industry spending
And I'm building an area chart over approximately 15 years. There are 39 different input descriptions and so I'd like the user to see the two major components (internal employee spending OR outsourcing/supply spending)in two major color brackets (say green and blue), but ggplot won't let me group my colors in that way.
Here are a few things I tried.
Junk code to reproduce
spec_trend_pie<- data.frame("year"=c(2006,2006,2006,2006,2007,2007,2007,2007,2008,2008,2008,2008),
"input_type" = c("labor", "labor", "supply", "supply", "labor", "labor","supply","supply","labor","labor","supply","supply"),
"input_desc" = c("plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck"),
"value" = c(1,2,3,4,4,3,2,1,1,2,3,4))
spec_broad <- ggplot(data = spec_trend_pie, aes(y = value, x = year, group = input_type, fill = input_desc)) + geom_area()
Which gave me
Error in f(...) : Aesthetics can not vary with a ribbon
And then I tried this
sff4 <- ggplot() +
geom_area(data=subset(spec_trend_pie, input_type="labor"), aes(y=value, x=variable, group=input_type, fill= input_desc)) +
geom_area(data=subset(spec_trend_pie, input_type="supply_chain"), aes(y=value, x=variable, group=input_type, fill= input_desc))
Which gave me this image...so closer...but not quite there.
To give you an idea of what is desired, here's an example of something I was able to do in GoogleSheets a long time ago.
It's a bit of a hack but forcats might help you out. I did a similar post earlier this week:
How to factor sub group by category?
First some base data
set.seed(123)
raw_data <-
tibble(
x = rep(1:20, each = 6),
rand = sample(1:120, 120) * (x/20),
group = rep(letters[1:6], times = 20),
cat = ifelse(group %in% letters[1:3], "group 1", "group 2")
) %>%
group_by(group) %>%
mutate(y = cumsum(rand)) %>%
ungroup()
Now, use factor levels to create gradients within colors
df <-
raw_data %>%
# create factors for group and category
mutate(
group = fct_reorder(group, y, max),
cat = fct_reorder(cat, y, max) # ordering in the stack
) %>%
arrange(cat, group) %>%
mutate(
group = fct_inorder(group), # takes the category into account first
group_fct = as.integer(group), # factor as integer
hue = as.integer(cat)*(360/n_distinct(cat)), # base hue values
light_base = 1-(group_fct)/(n_distinct(group)+2), # trust me
light = floor(light_base * 100) # new L value for hcl()
) %>%
mutate(hex = hcl(h = hue, l = light))
Create a lookup table for scale_fill_manual()
area_colors <-
df %>%
distinct(group, hex)
Lastly, make your plot
ggplot(df, aes(x, y, fill = group)) +
geom_area(position = "stack") +
scale_fill_manual(
values = area_colors$hex,
labels = area_colors$group
)

Sorting factors in multipanel plot in ggplot2 according to the first panel

Is it possible to sort factors in a multipanel plot in ggplot2 according to the first panel? The first panel decides the order and the remaining panels follow that order.
Here is an example:
require(ggplot2)
set.seed(36)
xx<-data.frame(YEAR=rep(c("X","Y"), each=20),
CLONE=rep(c("A","B","C","D","E"), each=4, 2),
TREAT=rep(c("T1","T2","T3","C"), 10),
VALUE=sample(c(1:10), 40, replace=T))
ggplot(xx, aes(x=CLONE, y=VALUE, fill=YEAR)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~TREAT)
Which gives me this plot:
Now I would like to sort CLONE based on the VALUE in YEAR X in a descending order (highest to lowest) but only for the Control (C panel). This order should then be maintained for T1, T2, and T3. By looking at the plot above, I want panel C sorted as CLONE C, B or D (both are 5), A and E. This order of CLONE should then be replicated for the remaining panels.
There's no easy way to do this right in ggplot since you have to reorder CLONE by
3 conditions, TREAT, YEAR and VALUE, otherwise forcats::fct_reorder2 could have been an option.
Instead, extract the order of CLONE from the subset of data corresponding to YEAR = "X",
TREAT = "C", and re-define your factor levels for the whole data set based on this subset.
library("ggplot2")
library("dplyr")
set.seed(36)
xx <- data.frame(YEAR = rep(c("X","Y"), each = 20),
CLONE = rep(c("A","B","C","D","E"), each = 4, 2),
TREAT = rep(c("T1","T2","T3","C"), 10),
VALUE = sample(c(1:10), 40, replace = TRUE), stringsAsFactors = FALSE)
clone_order <- xx %>% subset(TREAT == "C" & YEAR == "X") %>%
arrange(-VALUE) %>% select(CLONE) %>% unlist()
xx <- xx %>% mutate(CLONE = factor(CLONE, levels = clone_order))
ggplot(xx, aes(x = CLONE, y = VALUE, fill = YEAR)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~TREAT)
giving

Custom scatterplot matrix using facet_grid in ggplot2

I'm trying to write a custom scatterplot matrix function in ggplot2 using facet_grid. My data have two categorical variables and one numeric variable.
I'd like to facet (make the scatterplot rows/cols) according to one of the categorical variables and change the plotting symbol according to the other categorical.
I do so by first constructing a larger dataset that includes all combinations (combs) of the categorical variable from which I'm creating the scatterplot panels.
My questions are:
How to use geom_rect to white-out the diagonal and upper panels in facet_grid (I can only make the middle ones black so far)?
How can you move the titles of the facets to the bottom and left hand sides respectively?
How does one remove tick axes and labels for the top left and bottom right facets?
Thanks in advance.
require(ggplot2)
# Data
nC <- 5
nM <- 4
dat <- data.frame(
Control = rep(LETTERS[1:nC], nM),
measure = rep(letters[1:nM], each = nC),
value = runif(nC*nM))
# Change factors to characters
dat <- within(dat, {
Control <- as.character(Control)
measure <- as.character(measure)
})
# Check, lapply(dat, class)
# Define scatterplot() function
scatterplotmatrix <- function(data,...){
controls <- with(data, unique(Control))
measures <- with(data, unique(measure))
combs <- expand.grid(1:length(controls), 1:length(measures), 1:length(measures))
# Add columns for values
combs$value1 = 1
combs$value2 = 0
for ( i in 1:NROW(combs)){
combs[i, "value1"] <- subset(data, subset = Control==controls[combs[i,1]] & measure == measures[combs[i,2]], select = value)
combs[i, "value2"] <- subset(data, subset = Control==controls[combs[i,1]] & measure == measures[combs[i,3]], select = value)
}
for ( i in 1:NROW(combs)){
combs[i,"Control"] <- controls[combs[i,1]]
combs[i,"Measure1"] <- measures[combs[i,2]]
combs[i,"Measure2"] <- measures[combs[i,3]]
}
# Final pairs plot
plt <- ggplot(combs, aes(x = value1, y = value2, shape = Control)) +
geom_point(size = 8, colour = "#F8766D") +
facet_grid(Measure2 ~ Measure1) +
ylab("") +
xlab("") +
scale_x_continuous(breaks = c(0,0.5,1), labels = c("0", "0.5", "1"), limits = c(-0.05, 1.05)) +
scale_y_continuous(breaks = c(0,0.5,1), labels = c("0", "0.5", "1"), limits = c(-0.05, 1.05)) +
geom_rect(data = subset(combs, subset = Measure1 == Measure2), colour='white', xmin = -Inf, xmax = Inf,ymin = -Inf,ymax = Inf)
return(plt)
}
# Call
plt1 <- scatterplotmatrix(dat)
plt1
I'm not aware of a way to move the panel strips (the labels) to the bottom or left. Also, it's not possible to format the individual panels separately (e.g., turn off the tick marks for just one facet). So if you really need these features, you will probably have to use something other than, or in addition to ggplot. You should really look into GGally, although I've never had much success with it.
As far as leaving some of the panels blank, here is a way.
nC <- 5; nM <- 4
set.seed(1) # for reproducible example
dat <- data.frame(Control = rep(LETTERS[1:nC], nM),
measure = rep(letters[1:nM], each = nC),
value = runif(nC*nM))
scatterplotmatrix <- function(data,...){
require(ggplot2)
require(data.table)
require(plyr) # for .(...)
DT <- data.table(data,key="Control")
gg <- DT[DT,allow.cartesian=T]
setnames(gg,c("Control","H","x","V","y"))
fmt <- function(x) format(x,nsmall=1)
plt <- ggplot(gg, aes(x,y,shape = Control)) +
geom_point(subset=.(as.numeric(H)<as.numeric(V)),size=5, colour="#F8766D") +
facet_grid(V ~ H) +
ylab("") + xlab("") +
scale_x_continuous(breaks=c(0,0.5,1), labels=fmt, limits=c(-0.05, 1.05)) +
scale_y_continuous(breaks=c(0,0.5,1), labels=fmt, limits=c(-0.05, 1.05))
return(plt)
}
scatterplotmatrix(dat)
The main feature of this is the use of subset=.(as.numeric(H)<as.numeric(V)) in the call to geom_point(...). This subsets the dataset so you only get a point layer when the condition is met, e.g. in facets where is.numeric(H)<is.numeric(V). This works because I've left the H and V columns as factors and is.numeric(...) operating on a factor returns the levels, not the names.
The rest is just a more compact (and much faster) way of creating what you called comb.

Resources