plot factor frequency by group (a yield plot) - r

I have a data frame containing the test_outcome (PASS/FAIL) for each test_type performed on each test_subject. For example:
test_subject, test_type, test_outcome
person_a, height, PASS
person_b, height, PASS
person_c, height, FAIL
person_d, height, PASS
person_a, weight, FAIL
person_b, weight, FAIL
person_c, weight, PASS
person_d, weight, PASS
I would like to prepare a yield plot by test_type and test_subject.
Y-axis = yield i.e. num pass/(num pass + num fail)
X-axis = test_subject
fill: = A line for each test_type.
I would prefer to use ggplot2, can you please recommend the best approach here? e.g. how to reshape the data before plotting?

A quick dplyr answer, you will want to tidy up the graph based on your desired colours etc.
library(dplyr)
library(ggplot2)
dat <- dat %>% group_by(test_subject, test_type) %>%
summarise(passrate = sum(test_outcome=="PASS") / n())
ggplot(dat, aes(x = test_subject, y = passrate, fill = test_type)) +
geom_bar(stat = "identity", position = "dodge")
Edit: a line graph was requested. Normally, categorical groups shouldn't be connected by a line graph - as there is no reason to order them in a particular way.
ggplot(dat, aes(x = test_subject, y = passrate, col = test_type)) +
geom_line(aes(group = test_type)) +
geom_point()

Related

Grouped bar plot using ggplot2 [duplicate]

I want to create a side by side barplot using geom_bar() of this data frame,
> dfp1
value percent1 percent
1 (18,29] 0.20909091 0.4545455
2 (29,40] 0.23478261 0.5431034
3 (40,51] 0.15492958 0.3661972
4 (51,62] 0.10119048 0.1726190
5 (62,95] 0.05660377 0.1194969
With values on the x-axis and the percents as the side by side barplots. I have tried using this code,
p = ggplot(dfp1, aes(x = value, y= c(percent, percent1)), xlab="Age Group")
p = p + geom_bar(stat="identity", width=.5)
However, I get this error: Error: Aesthetics must either be length one, or the same length as the dataProblems:value. My percent and percent1 are the same length as value, so I am confused. Thanks for the help.
You will need to melt your data first over value. It will create another variable called value by default, so you will need to renames it (I called it percent). Then, plot the new data set using fill in order to separate the data into groups, and position = "dodge" in order put the bars side by side (instead of on top of each other)
library(reshape2)
library(ggplot2)
dfp1 <- melt(dfp1)
names(dfp1)[3] <- "percent"
ggplot(dfp1, aes(x = value, y= percent, fill = variable), xlab="Age Group") +
geom_bar(stat="identity", width=.5, position = "dodge")
Similar to David's answer, here is a tidyverse option using tidyr::pivot_longer to reshape the data before plotting:
library(tidyverse)
dfp1 %>%
pivot_longer(-value, names_to = "variable", values_to = "percent") %>%
ggplot(aes(x = value, y = percent, fill = variable), xlab="Age Group") +
geom_bar(stat = "identity", position = "dodge", width = 0.5)

Convert a geom_tile in dotplot in ggplot2

I am doing several heatmaps in ggplot2 using geom_tile. They work great but what if instead of tiles (little rectangles) I want to have dots. My input is a binary matrix (converted in a table using melt function).
My x and y are discrete factors. How do I produce circles or dots instead of tiles.....any idea?
Thanks!
example:
dat=data.frame(sample = c("a","a","a","b","b","b","c","c","c"), cond=c("x","y","z","x","y","z","x","y","z"),value=c("1","4","6","2","3","7","4","6","7"),score=c(0,1,1,0,0,0,1,1,1))
if I use the following plot:
ggplot(dat, aes(x = sample, y = cond, color = value)) +
geom_point()
I get the wrong plot. Instead, I would like to have or not have a dot where the score is 0 or 1 and color them by value factor.
I assume you mean to map score to your color aesthetic and not value, as written in your shared code.
Simply convert color to a factor in your initial aesthetics call:
ggplot(dat, aes(x = sample, y = cond, color = as.factor(score))) +
geom_point()
EDIT:
The user indicated that he would like to filter observations where score is not equal to 1, and then color the points by value. You can do so by adding the following pipe operation:
I assume you mean to map score to your color aesthetic and not value, as written in your shared code.
Simply convert color to a factor in your initial aesthetics call:
dat %>%
filter(score == 1) %>%
ggplot(aes(x = sample, y = cond, color = as.factor(value))) +
geom_point()
Note that there are only 3 levels of the factor score and we are missing level b from sample on the x-axis. Keep all levels by specifying drop = FALSE in scale_x_discrete():
dat %>%
filter(score == 1) %>%
ggplot(aes(x = sample, y = cond, color = as.factor(value))) +
geom_point() +
scale_x_discrete(drop = FALSE)

Showing number of values outside axis range in boxplot (using ggplot2 in R)

Sometimes you want to limit the axis range of a plot to a region of interest so that certain features (e.g. location of the median & quartiles) are emphasized. Nevertheless, it may be of interest to make it clear how many/what proportion of values lie outside the (truncated) axis range.
I am trying to show this when using ggplot2 in R and am wondering whether there is some buildt-in way of doing this in ggplot2 (or alternatively some sensible solution some of you may have used). I am not actually particularly wedded to any particular way of displaying this (e.g. jittered points with a different symbol at the edge of the plot, a little bar outside that depending on how full it is shows the proportion outside the range, some kind of other display that somehow conveys the information).
Below is some example code that creates some mock data and the kind of plot I have in mind (shown below the code), but without any clear indication exactly how much data is outside the y-axis range.
library(ggplot2)
set.seed(seed=123)
group <- rep(c(0,1),each=500)
y <- rcauchy(1000, group, 10)
mockdata <- data.frame(group,y)
ggplot(mockdata, aes(factor(group),y)) + geom_boxplot(aes(fill = factor(group))) + coord_cartesian(ylim = c(-40,40))
You may compute these values in advance and display them via e.g. geom_text:
library(dplyr)
upper_lim <- 40
lower_lim <- -40
mockdata$upper_cut <- mockdata$y > upper_lim
mockdata$lower_cut <- mockdata$y < lower_lim
mockdata$group <- as.factor(mockdata$group)
mockpts <- mockdata %>%
group_by(group) %>%
summarise(upper_count = sum(upper_cut),
lower_count = sum(lower_cut))
ggplot(mockdata, aes(group, y)) +
geom_boxplot(aes(fill = group)) +
coord_cartesian(ylim = c(lower_lim, upper_lim)) +
geom_text(y = lower_lim, data = mockpts,
aes(label = lower_count, x = group), hjust = 1.5) +
geom_text(y = upper_lim, data = mockpts,
aes(label = upper_count, x = group), hjust = 1.5)

ggplot2 label points just one layer

I am attempting to make boxplots of some complex data. I have sorted the classes by one particular field (not the class field) and would now like to be able to label each box with the value of that sort-by field. I know from the way the data is structured that the value of this sort-by attribute will be the same for every observation within the class, and I would like to essentially annotate the chart with this additional piece of information.
I thought of trying to accomplish this by adding a point layer to the plot and then labeling those points. I attempted to do this using code like this example I mocked up using the mtcars data set for reproducability. For the sake of this example pretend that the variable gears would be the same for each distinct value of cyl. The "gear/1000000" part is just to get the labels all near the axis.
mtcars %>% group_by(cyl) %>%
ggplot(aes(x = reorder(cyl, gear), y = mpg)) +
geom_point(show.legend = FALSE, aes(x = reorder(cyl, gear), y = gear/1000000)) +
geom_text(aes(label = gear)) +
geom_boxplot(aes(colour=carb),varwidth = TRUE)
I feel like this is close, but this code is putting the labels on the boxplots instead of on the points, which is the opposite of what I'm looking for. How can I ask ggplot to label only the points from geom_point()? Or is there an easier way to accomplish my objective?
EDIT:
Here is what my plot now looks like, thanks to the answer provided below.
Boxplots of IRI distribution for various pavement segments
Set a separate x and y aes for geom_text. In your code, you are plotting a label for each x,y in aes(x = reorder(cyl, gear), y = mpg) as that is the aes set in the parent ggplot. Instead, set a fixed y (offset by a given amount from your geom_point y value), and x (corresponding to the x value from your geom_point) inside geom_text:
For example (note: there is more than one gear value per cylinder as you stated)
mtcars %>% group_by(cyl) %>%
ggplot(aes(x = reorder(cyl, gear), y = mpg)) +
geom_point(show.legend = FALSE, aes(x = reorder(cyl, gear), y = gear/1000000)) +
geom_boxplot(aes(colour=carb),varwidth = TRUE) +
geom_text(aes(label = gear, x = reorder(cyl, gear), y = gear/1000000 - 2))

Visualizing the difference between two points with ggplot2

I want to visualize the difference between two points with a line/bar in ggplot2.
Suppose we have some data on income and spending as a time series.
We would like to visualize not only them, but the balance (=income - spending) as well.
Furthermore, we would like to indicate whether the balance was positive (=surplus) or negative (=deficit).
I have tried several approaches, but none of them produced a satisfying result. Here we go with a reproducible example.
# Load libraries and create LONG data example data.frame
library(dplyr)
library(ggplot2)
library(tidyr)
df <- data.frame(year = rep(2000:2009, times=3),
var = rep(c("income","spending","balance"), each=10),
value = c(0:9, 9:0, rep(c("deficit","surplus"), each=5)))
df
1.Approach with LONG data
Unsurprisingly, it doesn't work with LONG data,
because the geom_linerange arguments ymin and ymax cannot be specified correctly. ymin=value, ymax=value is definately the wrong way to go (expected behaviour). ymin=income, ymax=spending is obviously wrong, too (expected behaviour).
df %>%
ggplot() +
geom_point(aes(x=year, y=value, colour=var)) +
geom_linerange(aes(x=year, ymin=value, ymax=value, colour=net))
#>Error in function_list[[i]](value) : could not find function "spread"
2.Approach with WIDE data
I almost got it working with WIDE data.
The plot looks good, but the legend for the geom_point(s) is missing (expected behaviour).
Simply adding show.legend = TRUE to the two geom_point(s) doesn't solve the problem as it overprints the geom_linerange legend. Besides, I would rather have the geom_point lines of code combined in one (see 1.Approach).
df %>%
spread(var, value) %>%
ggplot() +
geom_linerange(aes(x=year, ymin=spending, ymax=income, colour=balance)) +
geom_point(aes(x=year, y=spending), colour="red", size=3) +
geom_point(aes(x=year, y=income), colour="green", size=3) +
ggtitle("income (green) - spending (red) = balance")
3.Approach using LONG and WIDE data
Combining the 1.Approach with the 2.Approach results in yet another unsatisfying plot. The legend does not differentiate between balance and var (=expected behaviour).
ggplot() +
geom_point(data=(df %>% filter(var=="income" | var=="spending")),
aes(x=year, y=value, colour=var)) +
geom_linerange(data=(df %>% spread(var, value)),
aes(x=year, ymin=spending, ymax=income, colour=balance))
Any (elegant) way out of this dilemma?
Should I use some other geom instead of geom_linerange?
Is my data in the right format?
Try
ggplot(df[df$var != "balance", ]) +
geom_point(
aes(x = year, y = value, fill = var),
size=3, pch = 21, colour = alpha("white", 0)) +
geom_linerange(
aes(x = year, ymin = income, ymax = spending, colour = balance),
data = spread(df, var, value)) +
scale_fill_manual(values = c("green", "red"))
Output:
The main idea is that we use two different types of aesthetics for colours (fill for the points, with the appropriate pch, and colour for the lines) so that we get separate legends for each.

Resources