Barplot with sorted dots - r

I want to plot a 4 group barplot from a first data-frame called df1 and display dots from another data-frame called df2. The idea is to check how many dots from df2 lie outside of df1.
So I made the following graph which works well.
### 0- Import package
library(dplyr)
### 1- Data simulation
set.seed(4)
df1 <- data.frame(var=c("a", "b", "c", "d"), value=c(15, 19, 18, 17))
df2 <- data.frame(var1=rep(c("a", "b", "c", "d"), each=20), value=rnorm(80, 15, 2), color=NA, fill=NA)
### 2- Coloring data (outside=red, inside=blue)
df2$fill <- case_when(
(df2$var1=="a" & df2$value>subset(df1, var=='a')$value) ~ "#e18b8b",
(df2$var1=="b" & df2$value>subset(df1, var=='b')$value) ~ "#e18b8b",
(df2$var1=="c" & df2$value>subset(df1, var=='c')$value) ~ "#e18b8b",
(df2$var1=="d" & df2$value>subset(df1, var=='d')$value) ~ "#e18b8b",
TRUE ~ "#8cbee2")
df2$color <- case_when(
(df2$var1=="a" & df2$value>subset(df1, var=='a')$value) ~ "#ca0d0d",
(df2$var1=="b" & df2$value>subset(df1, var=='b')$value) ~ "#ca0d0d",
(df2$var1=="c" & df2$value>subset(df1, var=='c')$value) ~ "#ca0d0d",
(df2$var1=="d" & df2$value>subset(df1, var=='d')$value) ~ "#ca0d0d",
TRUE ~ "#0c78ca")
### 3- Display plot
ggplot(aes(x=var, y=value), data=df1) + geom_bar(stat="identity", fill='#8cbee2', width=0.6) +
geom_point(data=df2, aes(x=var1, y=value), colour=df2$color, fill=df2$fill, position=position_jitter(width=0.05, height=0), shape=21, size=2)
In order to improve this graph, I would like to order dots from df2 displayed within each barplot group, kind of qqplot-shaped.
-First, this would allow to tell whether the amount of dots outside is huge or not compared to barplots.
-Second, this would allow to see distribution of inside & outside dots.
I have found the following link but it only deals with one data-frame and I am working with 2.
How to plot boxplots superimposed with sorted points using ggplot2
Do you have any clue on how to sort these dots?
EDIT
Result following Stephan's answer

One option to achieve your desired result would be to use position_dodge and a helper column. To this end first order your data by var1 and value, then add the helper column as an interaction of var1 and the row index or number. This helper column could then be mapped on the group aes to ensure that points are plotted in ascending order where the dodge gives the qqplot-like shape:
Note: I also used a different approach for the colors which uses a left_join and maps on the color and fill aes.
library(dplyr)
set.seed(4)
df1 <- data.frame(var = c("a", "b", "c", "d"), value = c(15, 19, 18, 17))
df2 <- data.frame(var1 = rep(c("a", "b", "c", "d"), each = 20), value = rnorm(80, 15, 2), color = NA, fill = NA)
df2 <- df2 %>%
left_join(df1, by = c("var1" = "var"), suffix = c("", "_df1")) %>%
arrange(var1, value) %>%
mutate(
var_dodge = interaction(var1, row_number()),
color = value > value_df1
)
library(ggplot2)
ggplot(aes(x = var, y = value), data = df1) +
geom_bar(stat = "identity", fill = "#8cbee2", width = 0.6) +
geom_point(
data = df2, aes(x = var1, y = value, group = var_dodge, color = color, fill = color),
position = position_dodge(width = .4), shape = 21, size = 2
) +
scale_color_manual(values = c("TRUE" = "#ca0d0d", "FALSE" = "#0c78ca")) +
scale_fill_manual(values = c("TRUE" = "#e18b8b", "FALSE" = "#8cbee2")) +
guides(fill = "none", color = "none")

Related

R / Tidyverse: Ordering factors within group with duplicate labels and plotting using facet_wrap

I am trying to plot a graph with ggplot where I facet on one variable, and make two plots (one for each variable showing a value (x) for a category (y), where y is plotted in descending order with respect to X within each group. The issue here is that in each group, the same y label exists and this seems to mess up the factor levels.
Setting up the data:
dummy_data <- tibble(
y_var = c('v1', 'v2', 'v3', 'v4', 'v5', 'v2', 'v6', 'v7', 'v4', 'v7'),
x_var = c(0.0629, 0.0631, 0.0654, 0.0657, 0.0676, 0.0693, 0.0707, 0.0728, 0.0733, 0.0868),
group_var = c("A", "B", "B", "A", "B", "A", "A", "A", "B", "B")
) %>%
group_by(group_var) %>% # group by the grouping variable
arrange(x_var, .by_group = TRUE) %>% # arrange the rows by x_var, within each group
mutate(y_var = factor(y_var, levels = y_var)) # change the y variable to a factor to retain order when plotting
dummy_data %>%
ggplot() +
geom_point(aes(x = x_var, y = y_var, color = group_var), size = 5) +
facet_wrap(~group_var, scales = 'free', dir = 'v')
The output, shown below, is not ordered as desired.
This can be rectified by making all y_var values unique (adding group name to the value), however this is not desirable because this changes y axis labels (shown below)
dummy_data %>%
mutate(y_var = str_c(y_var, group_var),
y_var = factor(y_var, levels = y_var)) %>%
ggplot() +
geom_point(aes(x = x_var, y = y_var, color = group_var), size = 5) +
facet_wrap(~group_var, scales = 'free', dir = 'v')
I would like to produce the second graph, without having to change the value of all y_var. Thanks!
tidytext::reorder_within() does something similar, and in combination with tidytext::scale_y_reordered() helps with tidying the output to look like your goal.
library(tidytext)
dummy_data %>%
mutate(y_var = reorder_within(y_var, x_var, group_var)) %>%
ggplot() +
geom_point(aes(x = x_var, y = y_var, color = group_var), size = 5) +
scale_y_reordered() +
facet_wrap(~group_var, scales = 'free', dir = 'v')

Specifying factor correctly to prevent ggplot legend from ordering alphabetically (I know this has been asked before)

I know this has been asked many times, which makes me a feel a bit better as to why I am confused. I am trying to keep the color order consistent with the variable selected for plotting and consistent in the legend.
I looked here and here for what I hoped would solve my problem, but it was either a different issue, or I must have either misunderstood or am making a mistake that I can't see.
I specified the order of the factor variable, Group, but the colors change with the alphabetical order as opposed to the factor order. Since I specify the Group as a factor and specify the level order ("B" is always first as "group_1") I would have thought the color would have been consistent for "B" but it changes.
If someone could let me know what I am missing I would greatly appreciate it!
Reproducible example
library(tidyverse)
# Sample data
Group <- c("A", "B", "C")
Value <- c(3, 3, 5)
# Create data frame
mydata <- data.frame(Group, Value)
# Create variable for group selected for plotting
group_1 <- "B"
group_2 <- "A"
# Make a pyramid plot, making one group negative numbers for bar chart
pyramid <- mydata %>%
filter(Group == group_1 | Group == group_2) %>%
mutate(Value = ifelse(Group == group_2, Value * -1, Value)) %>%
# Thought this would keep color ordering consistent
mutate(Group = factor(Group, levels = c(group_1, group_2)))
pyramid_plot <- ggplot(pyramid, aes(x = Group,
y = Value,
fill = Group)
) +
geom_bar(data = subset(pyramid, Group == group_1),
stat = "identity"
) +
geom_bar(data = subset(pyramid, Group == group_2),
stat = "identity"
) +
coord_flip() +
scale_fill_manual(name = "Group",
values = c("#1f78b4", "#33a02c"), # blue, green
breaks = c(group_1, group_2),
labels = c(group_1, group_2)
)
pyramid_plot
# Now do another plot, keeping group_1 the same but changing group_2
group_1 <- "B"
group_2 <- "C"
pyramid <- mydata %>%
filter(Group == group_1 | Group == group_2) %>%
mutate(Value = ifelse(Group == group_2, Value * -1, Value)) %>%
# Thought this would keep color ordering consistent
mutate(Group = factor(Group, levels = c(group_1, group_2)))
pyramid_plot2 <- ggplot(pyramid, aes(x = Group,
y = Value,
fill = Group)
) +
geom_bar(data = subset(pyramid, Group == group_1),
stat = "identity"
) +
geom_bar(data = subset(pyramid, Group == group_2),
stat = "identity"
) +
coord_flip() +
scale_fill_manual(name = "Group",
values = c("#1f78b4", "#33a02c"), # blue, green
breaks = c(group_1, group_2),
labels = c(group_1, group_2)
)
pyramid_plot2
Group <- c("A", "B", "C")
Value <- c(3, 3, 5)
mydata <- data.frame(Group, Value)
group_1 <- "B"
group_2 <- "A"
pyramid <- mydata %>%
filter(Group == group_1 | Group == group_2) %>%
mutate(Value = ifelse(Group == group_2, Value * -1, Value)) %>%
mutate(Group = factor(Group, levels = c(group_1, group_2)))
ggplot(pyramid, aes(x = Group, y = Value, fill = Group)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(name = "Group",
values = c("#1f78b4", "#33a02c"))
Output:
group_1 <- "B"
group_2 <- "C"
pyramid <- mydata %>%
filter(Group == group_1 | Group == group_2) %>%
mutate(Value = ifelse(Group == group_2, Value * -1, Value)) %>%
mutate(Group = factor(Group, levels = c(group_1, group_2)))
ggplot(pyramid, aes(x = Group, y = Value, fill = Group)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(name = "Group",
values = c("#1f78b4", "#33a02c"))
Second output:
B is always blue.
I think the issue was with calling the geom_bar twice and subsetting the data. Fill does that for you already, so no need to further slice the data.

ordering and plotting by one variable conditional on a second

Task: I would like to reorder a factor variable by the difference between the factor variable when a second variable equals 1 and the factor variable when the second variable equals 0. Here is a reproducible example to clarify:
# Package
library(tidyverse)
# Create fake data
df1 <- data.frame(place = c("A", "B", "C"),
avg = c(3.4, 4.5, 1.8))
# Plot, but it's not in order of value
ggplot(df1, aes(x = place, y = avg)) +
geom_point(size = 4)
# Now put it in order
df1$place <- factor(df1$place, levels = df1$place[order(df1$avg)])
# Plots in order now
ggplot(df1, aes(x = place, y = avg)) +
geom_point(size = 4)
# Adding second, conditional variable (called: new)
df2 <- data.frame(place = c("A", "A", "B", "B", "C", "C"),
new = rep(0:1, 3),
avg = c(3.4, 2.3, 4.5, 4.2, 2.1, 1.8))
ggplot(df2, aes(x = place, y = avg, col = factor(new))) +
geom_point(size = 3)
Goal: I would like to order and plot the factor variable place by the difference of avg between place when new is 1 and place when new is 0
You can create the levels for the place column by:
library(tidyr)
df2$place <- factor(df2$place, levels=with(spread(df2, new, avg), place[order(`1` - `0`)]))
ggplot(df2, aes(x = place, y = avg, col = factor(new))) +
geom_point(size = 3) + labs(color = 'new')
gives:
If I understand the goal correctly, then factor A has the biggest difference:
avg(new = 0) - avg(new = 1) = 1.1
So you can spread the data frame to calculate the difference, then gather, then plot avg versus place, reordered by diff. Or if you want A first, by -diff.
But let me know if I didn't understand correctly :)
df2 %>%
spread(new, avg) %>%
mutate(diff = `0` - `1`) %>%
gather(new, avg, -diff, -place) %>%
ggplot(aes(reorder(place, diff), avg)) +
geom_point(aes(color =factor(new)), size = 3)
Calculate the column first using dplyr:
df2 %>% group_by(place) %>% mutate(diff=diff(avg))
ggplot(df2, aes(x=place, y=diff, color=diff)+
geom_point(size=3)

ggplo2 in R: geom_segment displays different line than geom_line

Say I have this data frame:
treatment <- c(rep("A",6),rep("B",6),rep("C",6),rep("D",6),rep("E",6),rep("F",6))
year <- as.numeric(c(1999:2004,1999:2004,2005:2010,2005:2010,2005:2010,2005:2010))
variable <- c(runif(6,4,5),runif(6,5,6),runif(6,3,4),runif(6,4,5),runif(6,5,6),runif(6,6,7))
se <- c(runif(6,0.2,0.5),runif(6,0.2,0.5),runif(6,0.2,0.5),runif(6,0.2,0.5),runif(6,0.2,0.5),runif(6,0.2,0.5))
id <- 1:36
df1 <- as.data.table(cbind(id,treatment,year,variable,se))
df1$year <- as.numeric(df1$year)
df1$variable <- as.numeric(df1$variable)
df1$se <- as.numeric(df1$se)
As I mentioned in a previous question (draw two lines with the same origin using ggplot2 in R), I wanted to use ggplot2 to display my data in a specific way.
I managed to do so using the following script:
y1 <- df1[df1$treatment=='A'&df1$year==2004,]$variable
y2 <- df1[df1$treatment=='B'&df1$year==2004,]$variable
y3 <- df1[df1$treatment=='C'&df1$year==2005,]$variable
y4 <- df1[df1$treatment=='D'&df1$year==2005,]$variable
y5 <- df1[df1$treatment=='E'&df1$year==2005,]$variable
y5 <- df1[df1$treatment=='E'&df1$year==2005,]$variable
y6 <- df1[df1$treatment=='F'&df1$year==2005,]$variable
p <- ggplot(df1,aes(x=year,y=variable,group=treatment,color=treatment))+
geom_line(aes(y = variable, group = treatment, linetype = treatment, color = treatment),size=1.5,lineend = "round") +
scale_linetype_manual(values=c('solid','solid','solid','dashed','solid','dashed')) +
geom_point(aes(colour=factor(treatment)),size=4)+
geom_errorbar(aes(ymin=variable-se,ymax=variable+se),width=0.2,size=1.5)+
guides(colour = guide_legend(override.aes = list(shape=NA,linetype = c("solid", "solid",'solid','dashed','solid','dashed'))))
p+labs(title="Title", x="years", y = "Variable 1")+
theme_classic() +
scale_x_continuous(breaks=c(1998:2010), labels=c(1998:2010),limits=c(1998.5,2010.5))+
geom_segment(aes(x=2004, y=y1, xend=2005, yend=y3),colour='blue1',size=1.5,linetype='solid')+
geom_segment(aes(x=2004, y=y1, xend=2005, yend=y4),colour='blue1',size=1.5,linetype='dashed')+
geom_segment(aes(x=2004, y=y2, xend=2005, yend=y5),colour='red3',size=1.5,linetype='solid')+
geom_segment(aes(x=2004, y=y2, xend=2005, yend=y6),colour='red3',size=1.5,linetype='dashed')+
scale_color_manual(values=c('blue1','red3','blue1','blue1','red3','red3'))+
theme(text = element_text(size=12))
As you can see I used both geom_line and geom_segment to display the lines for my graph.
It's almost perfect but if you look closely, the segments that are drawn (between 2004 and 2005) do not display the same line size, even though I used the same arguments values in the script (i.e. size=1.5 and linetype='solid' or dashed).
Of course I could change manually the size of the segments to get similar lines, but when I do that, segments are not as smooth as the lines using geom_line.
Also, I get the same problem (different line shapes) by including the size or linetype arguments within the aes() argument.
Do you have any idea what causes this difference and how I can get the exact same shapes for both my segments and lines ?
It seems to be an anti-aliasing issue with geom_segment, but that seems like a somewhat cumbersome approach to begin with. I think I have resolved your issue by duplicating the A and B treatments in the original data frame.
# First we are going to duplicate and rename the 'shared' treatments
library(dplyr)
library(ggplot2)
df1 %>%
filter(treatment %in% c("A", "B")) %>%
mutate(treatment = ifelse(treatment == "A",
"AA", "BB")) %>%
bind_rows(df1) %>% # This rejoins with the original data
# Now we create `treatment_group` and `line_type` variables
mutate(treatment_group = ifelse(treatment %in% c("A", "C", "D", "AA"),
"treatment1",
"treatment2"), # This variable will denote color
line_type = ifelse(treatment %in% c("AA", "BB", "D", "F"),
"type1",
"type2")) %>% # And this variable denotes the line type
# Now pipe into ggplot
ggplot(aes(x = year, y = variable,
group = interaction(treatment_group, line_type), # grouping by both linetype and color
color = treatment_group)) +
geom_line(aes(x = year, y = variable, linetype = line_type),
size = 1.5, lineend = "round") +
geom_point(size=4) +
# The rest here is more or less the same as what you had
geom_errorbar(aes(ymin = variable-se, ymax = variable+se),
width = 0.2, size = 1.5) +
scale_color_manual(values=c('blue1','red3')) +
scale_linetype_manual(values = c('dashed', 'solid')) +
labs(title = "Title", x = "Years", y = "Variable 1") +
scale_x_continuous(breaks = c(1998:2010),
limits = c(1998.5, 2010.5))+
theme_classic() +
theme(text = element_text(size=12))
Which will give you the following
My numbers are different since they were randomly generated.
You can then modify the legend to your liking, but my recommendation is using something like geom_label and then be sure to set check_overlap = TRUE.
Hope this helps!

How to plot several boxplots by group in r?

ID <- 1:10
group <- c(1,1,1,2,2,2,3,3,3,3)
var1 <- c(6:15)
var2 <- c(7:16)
var3 <- c(6:11, NA, NA, NA, NA)
var4 <- c(4:9, NA, NA, NA, NA)
data <- data.frame(ID, group, var1, var2, var3, var4)
library(dplyr)
data %>% group_by(group) %>% boxplot(var1, var2)
The last line does not work as i wish. The idea is to get 4 boxplots in one graphic. Two for each variable. Maybe i need to use ggplot2?
You need to reorganize the data if you want to get both variables in the same plot. Here is a ggplot2 solution:
# load library
library(ggplot2)
library(tidyr)
library(ggthemes)
# reorganize data
df <- gather(data, "ID","group")
#rename columns
colnames(df) <- c("ID","group","var","value")
# plot
ggplot(data=df) +
geom_boxplot( aes(x=factor(group), y=value, fill=factor(var)), position=position_dodge(1)) +
scale_x_discrete(breaks=c(1, 2, 3), labels=c("A", "B", "C")) +
theme_minimal() +
scale_fill_grey()
Making boxplots with the same width is a whole different question (solution here), but one simple alternative would be like this:
# recode column `group` in the `data.frame`.
df <- transform(df, group = ifelse(group==1, 'A', ifelse(group==2, 'B', "C")))
# plot
ggplot(data=df) +
geom_boxplot( aes(x=factor(var), y=value, fill=factor((var))), position=position_dodge(1)) +
geom_jitter(aes(x=factor(var), y=value, color=factor((var)))) +
facet_grid(.~group, scales = "free_x") +
theme_minimal()+
scale_fill_grey() +
theme(axis.text.x=element_blank(),
axis.title.x=element_blank(),
axis.ticks=element_blank())
You might try melting the data frame (mentioned in comment by #lukeA) first and then sticking to base graphics. ggplot2 or lattice are other good options.
library(reshape2)
DF <- melt(data, id.vars = c("ID", "group"), measure.vars = c("var1", "var2"))
boxplot(value ~ group + variable, DF)
Alternate lattice code, also using DF:
bwplot(~ value | variable + group, data = DF)
Alternate ggplot2 code, also using DF:
ggplot(DF, aes(x = factor(group), y = value, fill = variable)) + geom_boxplot()
Although quite late, a found a great base-R solution here
# Create some data, e.g. from https://en.wikipedia.org/wiki/One-way_analysis_of_variance#Example
df <- as.data.frame(matrix(c(6, 8, 13, 8, 12, 9, 4, 9, 11, 5, 11, 8, 3, 6, 7, 4, 8, 12),ncol = 3, byrow = TRUE))
df <- reshape(data = df, direction = "long", idvar=1:3, varying=1:3, sep = "", timevar = "Treatment")
df$Treatment <- as.factor(df$Treatment)
rownames(df) <- NULL
par(mfrow = c(2, 1))
par(mar=c(1,4,4,2) + 0.1) # mar=c(b,l,t,r)
boxplot(V ~ Treatment, data = df, xlab = NULL, xaxt = "n",
ylab = "V", main = "One-way anova with 3 different levels of one factor")
stripchart(V ~ Treatment, # Points
data = df, # Data
method = "jitter", # Random noise
pch = 19, # Pch symbols
col = 4, # Color of the symbol
vertical = TRUE, # Vertical mode
add = TRUE) # Add it over
par(mar=c(5,4,0,2) + 0.1)
boxplot(V ~ Treatment, data = df, xlab = "Treatment",
ylab = "V", main = NULL)
stripchart(V ~ Treatment, # Points
data = df, # Data
method = "overplot", # Random noise
pch = 19, # Pch symbols
col = 4, # Color of the symbol
vertical = TRUE, # Vertical mode
add = TRUE) # Add it over
par(mfrow = c(1, 1))
Result:

Resources