adding ID to outliers in ggplot barplot in R - r

I have created a stacked barplot
ggplot(data %>% count(x, y),
aes(x, n, fill = factor(y))) +
geom_bar(stat="identity")+
theme_light()+
theme(plot.title = element_text(hjust=0.5))
there are (possible) outliers at 50,54 and 60. How can I add their ID into the graph?

If you post your data, I'll amend this answer using it. But basically you want
df %>%
count(x, y) %>%
ggplot(aes(x = x, y = n, fill = y)) +
geom_col() +
geom_text(aes(label = x), data = . %>% filter(x >= thresh), vjust = 0, nudge_y = 0.1)
where thresh is some threshold you've set--maybe an arbitrary cutoff point that makes sense, or maybe 3 standard deviations from the mean of x, or whatever. You can store it in an outside variable, you can make a boolean column in your dataframe, or you can calculate it inline inside your geom_text--really up to you. vjust = 0, nudge_y = 0.1 puts the labels just above the bars corresponding to your outliers.

Maybe geom_text(data=mydata%>%filter(just.the.outliers) ?
See also this: RE: Alignment of numbers on the individual bars with ggplot2

Related

How to level a ggplot2 histogram with two classes, with independent levels for each class?

Suppose I have this data:
xy <- data.frame(cbind(c(1,2,3,4,5,2,3,4),c(rep('A',5),rep('B',3))))
So, when I type
ggplot(xy, aes(x = x, fill = y)) +
geom_histogram(aes(y=..count../sum(..count..)), position = "dodge")
I get this graphic:
But I wanted to see the levels independently leveled, i.e., the red bars leveled to 0.2 and the blue bars leveled to 0.333. How can I achieve it?
Also, how can I set the y-axis to show the numbers in percentage instead of decimals?
Many thanks in advance.
This seems to do the job. It uses ..density.. rather than ..count.., a rather ugly way to count the number of levels in the A/B factor column, and then the scales package to get the labels on the y axis
ggplot(xy, aes(x = x, fill = y)) +
geom_histogram(aes(y=..density../sum(..density..)*length(unique(xy$y)), group = y), position = "dodge") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1))
Alternatively to calculate everything in ggplot, you can first calculate the relative frequency and then use this value to plot it with geom_col. preserve = "single" preserves equal width of the bars:
library(ggplot2)
library(dpylr)
xy <- data.frame(x = c(1,2,3,4,5,2,3,4),
y = c(rep('A',5),rep('B',3)))
xy <- xy %>%
group_by(y, x) %>%
summarise(rel_freq = n()) %>%
mutate(rel_freq = rel_freq / n())
ggplot(xy, aes(x = x, y = rel_freq, fill = y)) +
geom_col(position = position_dodge2(preserve = "single")) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1))

Readjusting the horizontal axis in ggplot

I have a simple dataset, containing values from 0 to 1. When I plot it, naturally, the horizontal axis is zero. I would like this reference to be 0.5 and the bars falling below 0.5 to be reversed and colored differently than those falling above this threshold.
my.df <- data.frame(group=state.name[1:20],col1 = runif(20))
p <- ggplot(my.df, aes(x=group,y=col1)) +
geom_bar(stat="identity")+ylim(0,0.5)
I am thinking of dissecting the data into two, one subset being greater than 0.5 and the other being larger than 0.5, then somewhat combining these two subsets in the same ggplot. Is there any other clearer way to do that? Thanks!
To build on #jas_hughes's answer, you can subtract 0.5 from your col1 variable, then rename the labels on the y-axis.
df <- data.frame(group=state.name[1:20],value=runif(20))
df %>% ggplot(aes(reorder(group,value),value-0.5)) + geom_bar(stat='identity') +
scale_y_discrete(name='Value',
labels=c('0','0.5','1'),
limits=c(-0.5,0,0.5),
expand = c(-0.55, 0.55)) +
xlab('State') +
theme(axis.text.x = element_text(angle=45,hjust=1))
The y-variable you are trying to communicate is distance from 0.5, so you need to change the values in col1 to reflect this.
library(dplyr)
library(ggplot)
my.df %>%
mutate(col2 = col1-0.5) %>%
ggplot() +
aes(x = group, y = col2, fill = col2 >=0) +
geom_bar(stat = 'identity') +
theme(legend.position = 'none',
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
ylab('Col1 above 0.5 (AU)')
Note, you can also use the aes(fill = col1 >= 0.5) option to color code the bars without shifting the axis (which is what I would recommend if col1 contains percentages).

how to plot geom_col() to get y axis to center around 1 and not 0

I am plotting a some data using ggplot2's geom_bar. The data represents a ratio that should center around 1 and not 0. This would allow me to highlight which categories go below or above this central ratio number. I've tried playing with set_y_continuous() and ylim(), neither of which allow me to sent a central axis value.
Basically: how to I make Y center around 1 and not 0.
sorry if i am asking a question that's been answered... maybe I just don't know the right key words?
ggplot(data = plotdata) +
geom_col(aes(x = stressclass, y= meanexpress, color = stressclass, fill = stressclass)) +
labs(x = "Stress Response Category", y = "Average Response Normalized to Control") +
facet_grid(exposure_cond ~ .)
As of now my plots look like this:
You can pre-process your y-values so that the plot actually starts at 0, then change the scale labels to reflect the original values (demonstrating with a built-in dataset):
library(dplyr)
library(ggplot2)
cut.off = 500 # (= 1 in your use case)
diamonds %>%
filter(clarity %in% c("SI1", "VS2")) %>%
count(cut, clarity) %>%
mutate(n = n - cut.off) %>% # subtract cut.off from y values
ggplot(aes(x = cut, y = n, fill = cut)) +
geom_col() +
geom_text(aes(label = n + cut.off, # label original values (optional)
vjust = ifelse(n > 0, 0, 1))) +
geom_hline(yintercept = 0) +
scale_y_continuous(labels = function(x) x + cut.off) + # add cut.off to label values
facet_grid(clarity ~ .)

Adding value labels at the end of the plot

I'm trying to plot the labels in the plot. I did that but they don't really lok very good. Here's an example to better understand it, I have this plot:
And I want to add the values in the end, I did that but they look strange:
Is there any way of how I could fix this? I do the plotting at the end of the code where it is commented #plotting.
Here is the reproducible code:
library(glmnet)
library(dplyr)
library(tidyr)
set.seed(100)
n=100
p=50
X=matrix(rnorm(n*p), nrow=n)
y=matrix(rnorm(n), nrow=n)
lam = seq(0.1,7,length.out=100)
lm=glmnet(X,y,alpha=1,lambda=lam, intercept=FALSE, standardize=FALSE)
value1=as.matrix(coef(lm))
#creating a dataframe
L1 <- function(x)
sum(abs(x))
bind_cols(
as.data.frame(value1) %>%
summarise_all(funs(L1(.))) %>%
t() %>%
as.data.frame() %>%
rename(x = V1),
t(value1) %>%
as.data.frame() %>%
rename_all(funs(gsub("V", "", .)))
) %>%
gather(row, y, 2:(nrow(value1) + 1)) -> dataf
#plotting
ggplot(dataf, aes(x, y, colour = row)) + geom_line() +
geom_text_repel(
data = subset(dataf, x == max(x)),
aes(label = row),
size = 2,
nudge_x = 1
) +
theme(legend.position = "none")
The main thing going on here is that you have a bunch of text labels all in one spot, so by repelling them and letting them have segments attaching labels to their values, you end up with this starburst at the end of your plot.
To see what I mean, filter your data for the maximum x value, which is where you're placing your label, and rows where y == 0: there's 35 of these! So you have 35 bits of text all vying for the same spot and being repelled away from one another.
dataf %>%
filter(x == max(x), y == 0) %>%
nrow()
#> [1] 35
Second way you can see this is if you set the color of the segments connecting the texts to their values. If you set it to gray, you can distinguish those segments from the actual geom_lines, since now they aren't the same color.
ggplot(dataf, aes(x, y, colour = row)) +
geom_line() +
geom_text_repel(
data = . %>% filter(x == max(x)),
aes(label = row),
size = 2,
nudge_x = 0.01,
segment.color = "gray60"
) +
scale_x_continuous(expand = expand_scale(mult = c(0.05, 0.1))) +
theme(legend.position = "none")
Here are a couple ways you can avoid this tangle: I decreased the nudge_x so the texts would be closer to the lines (nudge_x works in relation to your x values, so nudging over by 1 when values are only 0 to 0.6ish puts the labels very far away). I changed the segment color to something neutral, and adjusted the minimum distance before the segments are drawn. I added a expand_scale to give some more space on the right side (this is only in the dev version of ggplot still). And most importantly, I took out labels for values of 0.
You should probably tweak these things to your liking, but hopefully this is a start in cleaning it up.
ggplot(dataf, aes(x, y, colour = row)) +
geom_line() +
geom_text_repel(
data = . %>% filter(x == max(x), y != 0),
aes(label = row),
size = 2,
nudge_x = 0.01,
min.segment.length = 5,
segment.color = "gray60"
) +
scale_x_continuous(expand = expand_scale(mult = c(0.05, 0.1))) +
theme(legend.position = "none")
Created on 2018-06-11 by the reprex package (v0.2.0).

Lines connecting jittered points - dodging by multiple groups

I try to connect jittered points between measurements from two different methods (measure) on an x-axis. These measurements are linked to one another by the probands (a), that can be separated into two main groups, patients (pat) and controls (ctr),
My df is like that:
set.seed(1)
df <- data.frame(a = rep(paste0("id", "_", 1:20), each = 2),
value = sample(1:10, 40, rep = TRUE),
measure = rep(c("a", "b"), 20), group = rep(c("pat", "ctr"), each = 2,10))
I tried
library(ggplot2)
ggplot(df,aes(measure, value, fill = group)) +
geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0.1,
dodge.width = 0.75), shape = 1) +
geom_line(aes(group = a), position = position_dodge(0.75))
Created on 2020-01-13 by the reprex package (v0.3.0)
I used the fill aesthetic in order to separate the jittered dots from both groups (pat and ctr). I realised that when I put the group = a aesthetics into the ggplot main call, then it doesn't separate as nicely, but seems to link better to the points.
My question: Is there a way to better connect the lines to the (jittered) points, but keeping the separation of the two main groups, ctr and pat?
Thanks a lot.
The big issue you are having is that you are dodging the points by only group but the lines are being dodged by a, as well.
To keep your lines with the axes as is, one option is to manually dodge your data. This takes advantage of factors being integers under the hood, moving one level of group to the right and the other to the left.
df = transform(df, dmeasure = ifelse(group == "ctr",
as.numeric(measure) - .25,
as.numeric(measure) + .25 ) )
You can then make a plot with measure as the x axis but then use the "dodged" variable as the x axis variable in geom_point and geom_line.
ggplot(df, aes(x = measure, y = value) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
If you also want jittering, that can also be added manually to both you x and y variables.
df = transform(df, dmeasure = ifelse(group == "ctr",
jitter(as.numeric(measure) - .25, .1),
jitter(as.numeric(measure) + .25, .1) ),
jvalue = jitter(value, amount = .1) )
ggplot(df, aes(x = measure, y = jvalue) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
This turned out to be an astonishingly common question and I'd like to add an answer/comment to myself with a suggestion of a - what I now think - much, much better visualisation:
The scatter plot.
I originally intended to show paired data and visually guide the eye between the two comparisons. The problem with this visualisation is evident: Every subject is visualised twice. This leads to a quite crowded graphic. Also, the two dimensions of the data (measurement before, and after) are forced into one dimension (y), and the connection by ID is awkwardly forced onto your x axis.
Plot 1: The scatter plot naturally represents the ID by only showing one point per subject, but showing both dimensions more naturally on x and y. The only step needed is to make your data wider (yes, this is also sometimes necessary, ggplot not always requires long data).
The box plot
Plot 2: As rightly pointed out by user AllanCameron, another option would be to plot the difference of the paired values directly, for example as a boxplot. This is a nice visualisation of the appropriate paired t-test where the mean of the differences is tested against 0. It will require the same data shaping to "wide format". I personally like to show the actual values as well (if there are not too many).
library(tidyr)
library(dplyr)
library(ggplot2)
## first reshape the data wider (one column for each measurement)
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
## now use the new columns for your scatter plot
ggplot() +
geom_point(aes(time_a, time_b, color = group)) +
## you can add a line of equality to make it even more intuitive
geom_abline(intercept = 0, slope = 1, lty = 2, linewidth = .2) +
coord_equal()
Box plot to show differences of paired values
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
ggplot(aes(x = "", y = time_a - time_b)) +
geom_boxplot() +
# optional, if you want to show the actual values
geom_point(position = position_jitter(width = .1))

Resources