Adding lines to grouped boxplots - r

I have a dataset with 3 factors (Parent.organization, Hierarchy, variable) as well as a metric variable (value) and could use some help. Here is some sample data of the same style:
sampleData <- data.frame(id = 1:100,
Hierarchy = sample(c("Consultant", "Registrar", "Intern", "Resident"), 100, replace = TRUE),
Parent.organization = sample(c("Metropolitan", "Regional"), 100, replace = TRUE),
variable = sample(c("CXR", "AXR", "CTPA", "CTB"), 100, replace = TRUE),
value = rlnorm(20, log(10), log(2.5)))
summary(sampleData)
Using the following code I get the graph below
library(ggplot2)
library(scales)
p0 = ggplot(sampleData, aes(x = Hierarchy, y = value, fill = variable)) +
geom_boxplot()
plog = p0 + scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
theme_bw() +
facet_grid(.~Parent.organization, scales = "free", space = "free")
I have a set of values I want to mark for each scan variable (these are the same across all elements of the hierarchy and represent true values). Lets say they are 3, 5, 7, 5 for AXR, CTB, CTPA, CXR respectively. I want these overlayed on top but I am unsure how to proceed.
I'm after something akin to (I've just filled the first two but the same pattern would apply across the board):
My knowledge of R is improving but I'd say I'm still fairly inept. Also any suggestions on how to improve my question are also very welcome.

First, you have to make new data frame for the lines, where you have the same grouping and facetting variables as in original data frame. All the data should be repeated for the all combinations.
true.df<-data.frame(Hierarchy =rep(rep(c("Consultant", "Registrar", "Intern", "Resident"),each=4),times=2),
Parent.organization = rep(c("Metropolitan", "Regional"),each=16),
variable = rep(c("AXR", "CTB", "CTPA", "CXR"),times=8),
true.val=rep(c(3,5,7,5),times=8))
Then you can use geom_crossbar() to add the lines. Use true.val for the y, ymin and ymax to get lines. position=position_dodge() will ensure that lines are dodged and show_guide=FALSE will ensure that legend isn't affected.
plog+geom_crossbar(data=true.df,aes(x = Hierarchy,y=true.val,ymin=true.val,
ymax=true.val,fill=variable),
show_guide=FALSE,position=position_dodge(),color="red")

Related

How to specify unique geom assignments to facets?

Below I have simulated a dataset where an assignment was given to 5 groups of individuals on 5 different days (a new group with 200 new individuals each day). TrialStartDate denotes the date on which the assignment was given to each individual (ID), and TrialEndDate denotes when each individual finished the assignment.
set.seed(123)
data <-
data.frame(
TrialStartDate = rep(c(sample(seq(as.Date('2019/02/01'), as.Date('2019/02/15'), by="day"), 5)), each = 200),
TrialFinishDate = sample(seq(as.Date('2019/02/01'), as.Date('2019/02/15'), by = "day"), 1000,replace = T),
ID = seq(1,1000, 1)
)
I am interested in comparing how long individuals took to complete the trial depending on when they started the trial (i.e., assuming TrialStartDate has an effect on the length of time it takes to complete the trial).
To visualize this, I want to make a barplot showing counts of IDs on each TrialFinishDate where bars are colored by TrialStartDate (since each TrialStartDate acts as a grouping variable). The best I have come up with so far is by faceting like this:
data%>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
facet_wrap(~TrialStartDate, ncol = 1)
However, I also want to add a vertical line to each facet showing when the TrialStartDate was for each group (preferably colored the same as the bars). When attempting to add vertical lines with geom_vline, it adds all the lines to each facet:
data%>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
geom_vline(xintercept = unique(data$TrialStartDate))+
facet_wrap(~TrialStartDate, ncol = 1)
How can we make the vertical lines unique to the respective group in each facet?
You're specifying xintercept outside of aes, so the faceting is not respected.
This should do the trick:
data %>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
geom_vline(aes(xintercept = TrialStartDate))+
facet_wrap(~TrialStartDate, ncol = 1)
Note geom_vline(aes(xintercept = TrialStartDate))

R control jitter function - avoid overplotting / non-random jitter

My problems seems simple, I am using ggplot2 with geom_jitter() to plot a variable. (take my picture as an example)
Jitter now adds some random noise to the variable (the variable is just called "1" in this example) to prevent overplotting. So I have now random noise in the y-direction and clearly what otherwise would be completely overplotted is now better visible.
But here is my question:
As you can see, there are still some points, that overplot each other. In my example here, this could be easily prevented, if it wouldn't be random noise in y-direction... but somehow more strategically placed offsets.
Can I somehow alter the geom_jitter() behavior or is there a similar function in ggplot2 that does exactly this?
Not really a minimal example, but also not too long:
library("imputeTS")
library("ggplot2")
data <- tsAirgap
# 2.1 Create required data
# Get all indices of the data that comes directly before and after an NA
na_indx_after <- which(is.na(data[1:(length(data) - 1)])) + 1
# starting from index 2 moves all indexes one in front, so no -1 needed for before
na_indx_before <- which(is.na(data[2:length(data)]))
# Get the actual values to the indices and put them in a data frame with a label
before <- data.frame(id = "1", type = "before", input = na_remove(data[na_indx_before]))
after <- data.frame(id = "1", type = "after", input = na_remove(data[na_indx_after]))
all <- data.frame(id = "1", type = "source", input = na_remove(data))
# Get n values for the plot labels
n_before <- length(before$input)
n_all <- length(all$input)
n_after <- length(after$input)
# 2.4 Create dataframe for ggplot2
# join the data together in one dataframe
df <- rbind(before, after, all)
# Create the plot
gg <- ggplot(data = df) +
geom_jitter(mapping = aes(x = id, y = input, color = type, alpha = type), width = 0.5 , height = 0.5)
gg <- gg + ggplot2::scale_color_manual(
values = c("before" = "skyblue1", "after" = "yellowgreen","source" = "gray66"),
)
gg <- gg + ggplot2::scale_alpha_manual(
values = c("before" = 1, "after" = 1,"source" = 0.3),
)
gg + ggplot2::theme_linedraw() + theme(aspect.ratio = 0.5) + ggplot2::coord_flip()
So many good suggestions...here is what Bens suggestion would look like for my example:
I changed parts of my code to:
gg <- ggplot(data = df, aes(x = input, color = type, fill = type, alpha = type)) +
geom_dotplot(binwidth = 15)
Would basically also work as intended for me. ggbeeplot as suggested by Jon also worked great for my purpose.
I thought of a hack I really like, using ggrepel. It's normally used for labels, but nothing preventing you from making the label into a point.
df <- data.frame(x = rnorm(200),
col = sample(LETTERS[1:3], 200, replace = TRUE),
y = 1)
ggplot(df, aes(x, y, label = "●", color = col)) + # using unicode black circle
ggrepel::geom_text_repel(segment.color = NA,
box.padding = 0.01, key_glyph = "point")
A downside of this method is that ggrepel can take a lot time for a large number of points, and will recalculate differently each time you change the plot size. A faster alternative would be to use ggbeeswarm::geom_quasirandom, which uses a deterministic process to define jitter that looks random.
ggplot(df, aes(x,y, color = col)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE)

How to create separate facets for different measurements with tidyverse?

I am a novice programmer looking to plot highly grouped variables. Specifically, I am trying to plot a variable that is grouped by 5 other variables. Below is an example data that I am working with.
library(ggplot2)
library(tibble)
set.seed(42)
mydf <- tibble(
grp = rep(c('A', 'B'), length.out = 32, each = 16),
sex = rep(c('M', 'F'), length.out = 32, each = 8),
cond = rep(c('Wet', 'Dry'), length.out = 32, each = 4),
measure = rep(c('Tempature', 'Volume'), length.out = 32, each = 2),
kind = rep(c('Experimental', 'Control'), length.out = 32, each = 1),
value = rnorm(32) * 100,
)
ggplot(mydf, aes(x = grp, y = value, col = cond)) +
geom_point() +
facet_wrap(sex~measure + kind)
However, the output is quite messy. Would it be possible to create separate faceted plots for each measurement? What would be a proper way to graph this type of data?
Thank you
For ease of comparison, I would facet on no more than two variables. I would also use facet_grid() rather than facet_wrap() in such cases, as I think it's just easier to keep track of the different facet dimensions if they are on separate axes.
In your case, you want to distinguish measurements for 5 binary variables.
grp
sex
cond
measure
kind
With "grp" on the x-axis, "sex" distinguished by colour, and 2 of the remaining 3 on facets, we'll need to introduce another aesthetic parameter to distinguish the last variable.
In this case, since there aren't too many points to plot, I suggest shape.
ggplot(mydf, aes(x = grp, y = value,
color = cond,
shape = kind)) +
geom_point(size = 5, stroke = 2) +
facet_grid(sex~measure) +
scale_shape_manual(values = c("Control" = 4, "Experimental" = 16),
breaks = c("Experimental", "Control"))
The use of a filled shape vs an un-filled shape makes Experimental points visually distinct from Control points. You can check out other shape options here.
Note that if there are many different values in your grouping variables (e.g. 5 categories along the x-axis, 6 different colours, 20 facet combinations, etc.), or many points within each facet, the plot will look very busy, and you may want to split into separate plots rather than keep everything together.

Adding dummy values on axis in ggplot2 to add asymmetric distance between ticks

How to add dummy values on x-axis in ggplot2
I have 0,2,4,6,12,14,18,22,26 in data and that i have plotted on x-axis. Is there a way to add the remaining even numbers for which there is no data in table? this will create due spaces on the x-axis.
after the activity the x-axis should show 0,2,4,6,8,10,12,14,16,18,20,22,24,26
i have tried using rbind.fill already to add dummy data but when I make them factor the 8,10,12etc coming in last
Thanks
enter image description here
Hope this make sense:
library(ggplot2)
gvals <- factor(letters[1:3])
xvals <- factor(c(0,2,4,6,12,14,18,22,26), levels = seq(0, 26, by = 2))
yvals <- rnorm(10000, mean = 2)
df <- data.frame(x = sample(xvals, size = length(yvals), replace = TRUE),
y = yvals,
group = sample(gvals, size = length(yvals), replace = TRUE))
ggplot(df, aes(x = x, y = y)) + geom_boxplot(aes(fill = group)) +
scale_x_discrete(drop = FALSE)
The tricks are to make the x-variable with all levels you need and to specify drop = FALSE in scale.

ggplot2 R, Fixing much values in axis (Line-plot)

I can't read my y-axis since is has a lot of values. I tried rotating it and it doesn't work like I want, neither is it something I want to do.
I want to specify the values in the axis, to be from say 20 to 30, maybe with step 0.1.
But the length of the values are 1000, so I guess the range suggested above doesn't work (?).
Ex:
runNumbers <- seq(from = 1, to = 1000)
tempVector <- seq(from = 20.0010, to = 30, by = 0.01)
plotData <- data.frame(RunNumber = runNumbers, temp = tempVector,
myUglyPlot <- ggplot(data = plotData, mapping = aes(x = RunNumber, y = temp, group = 1)) + geom_line()
#
#http://stackoverflow.com/questions/14428887/overflowing-x-axis-ggplot2?noredirect=1&lq=1
require(scales) # for removing scientific notation
# manually generate breaks/labels
labels <- seq(from = 0, to = 30, length.out = 1000)
# and set breaks and labels
myUglyPlot <- myUglyPlot + scale_y_discrete(breaks = labels, labels = as.character(labels))
# And now my graph is without labels, why?
Is there another way to do this, without rotating my labels? Or am I doing something wrong in the code from the other question (I tried to follow what he did...)?
Later I will have 10 000 values instead, so I really need to change this, I want to have a readable axis, that I can put the interval in.
Maybe I'm missing in some simple concept, I tried to search and read R Graphics Cookbook, but without success for now.
Thanks for your time.
Update
Im trying to use breaks, thanks for the help guys. Here's what I'm doing now (only this):
myUglyPlot <- ggplot(data = plotData, mapping = aes(x = RunNo, y = t_amb, group = 1)) + geom_line()
myUglyPlot <- myUglyPlot + scale_y_discrete(breaks=seq(from = 1, to = 50, by = 0.01))
But my it doesn't give me any breaks. See pic.
You are almost there.. Since your y-axis is a continuous value, you need to use scale_y_continuous instead of scale_y_discrete.
myUglyPlot <- myUglyPlot + scale_y_continuous(breaks = labels)

Resources