Adding geom_line mean to reordered geom_point plot in ggplot - r

Trying to produce a point plot that reorders my values and also has a mean line above the values.
I can produce the plot with the mean line, or the reordered values but not both at the same time because I get the error
"geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?".
I believe I am getting the error as some of my data only has one observation but I don't understand why this only becomes an issue with the reorder data.
In the end all I want is to be able to show the means of the two different values groups for each x value.
Here is my sample code
library(ggplot2)
typ <- c("T", "N", "T", "T", "N")
samplenum <- c(7,7,6,8,8)
values <- c(1,2,1,3,2)
df = data.frame(typ, samplenum, values)
d <- ggplot(df, aes(x= reorder(samplenum, values), y= values))
d <- d + geom_point(position=position_jitter(width=0.15, height=0.05))
d <- d + aes(colour = factor(df$typ))
d <- d + stat_summary(fun.y = mean, geom="line")
d
Thank you for the help in advance.
This is what I am going for
Here is some steps before the completion sample pictures of what I have produced from my larger data set.
With Line but Not Reordered
Reordered but No Mean Line

As the error message suggests, you need to adjust the group aesthetic. When you use reorder you will end up with a discrete scale but you want to draw lines that connect across groups, that's why the error.
You can try this
ggplot(df, aes(x = reorder(samplenum, values), y = values, colour = factor(typ))) +
geom_jitter(width = 0.15, height = 0.05) +
stat_summary(fun.y = mean, geom = "line", aes(group = factor(typ)))
(I altered your data slighly so it contains more observations.)
data
df <- structure(list(typ = structure(c(2L, 1L, 2L, 2L, 1L, 2L, 1L,
2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("N", "T"), class = "factor"),
samplenum = c(7, 7, 6, 8, 8, 7, 7, 6, 8, 8, 7, 7, 6, 8, 8
), values = c(1L, 3L, 2L, 1L, 3L, 3L, 1L, 3L, 2L, 2L, 2L,
1L, 3L, 1L, 2L)), .Names = c("typ", "samplenum", "values"
), row.names = c(NA, -15L), class = "data.frame")
The resulting plot with your input data

Related

gganimate transition_reveal() with geom_line() breaking on the final frame?

I am trying to animate a line graph with multiple lines. It seems that there is an error with the gganimate package involving transition_reveal() that is causing the final frame to revert for all of the lines but one. This error is not present when not using gganimate. Here is the code:
df <- read.csv("test.csv", stringsAsFactors = TRUE)
anim <- ggplot(df, aes(Day, Accidents, group = State, color = State)) +
geom_line() +
transition_reveal(Day) +
ease_aes('cubic-in-out')
jiff <- animate(anim, fps = 24, duration = 5, start_pause = 0, end_pause = 72, height = 4, width = 7, units = "in", res = 150)
jiff
Here is the dput of the dataframe:
structure(list(State = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L), levels = c("A", "B", "C", "D"), class = "factor"),
Day = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
Accidents = c(5L, 2L, 5L, 6L, 1L, 2L, 6L, 8L, 4L, 10L, 2L,
4L)), class = "data.frame", row.names = c(NA, -12L))
Here is the output:
Regardless of the ending pause or how many values I have along the x-axis, the final frame will always look like this with only one line appearing as updated. Does anyone know why this might be happening?
UPDATE: Reverting the gganimate package from 1.0.8 to 1.0.7 did seem to do the trick after all.
The issue is in this line start_pause = 0, end_pause = 72,. Remove or adapt it:
anim <- ggplot(df, aes(Day, Accidents, group= State, color = State)) +
geom_line() +
transition_reveal(Day) +
ease_aes('cubic-in-out')
animate(anim, fps = 24, duration = 5,
height = 4, width = 7, units = "in", res = 150)

Ggplot stacked bar plot with percentage labels

I am trying to do a stacked bar plot based on count, but with the labels showing the percentage on the plot. I have produced the plot below. However the percentage is based on all of the data. What I am after is the percentage by team (such that the sum of the percentages for Australia = 100% and the percentages for England = 100%).
The code for achieving this is the following function. This function counts the number of different roles in each team across 5 matches (I have had to divide the result by 10 as a players role appears twice for each match (5 matches x 2 appearances):
team_roles_Q51 <- function(){
ashes_df <- tidy_data()
graph <- ggplot(ashes_df %>%
count(team, role) %>% #Groups by team and role
mutate(pct=n/sum(n)), #Calculates % for each role
aes(team, n, fill=role)) +
geom_bar(stat="identity") +
scale_y_continuous(labels=function(x)x/10) + #Needs to be a better way than dividing by 10
ylab("Number of Participants") +
geom_text(aes(label=paste0(sprintf("%1.1f", pct*100),"%")),
position=position_stack(vjust=0.5)) +
ggtitle("England & Australia Team Make Up") +
theme_bw()
print(graph)
}
An example of the dataframe that is imported is:
Structure for the first 10 rows of the dataframe as follows:
structure(list(batter = c("Ali", "Anderson", "Bairstow", "Ball",
"Bancroft", "Bird", "Broad", "Cook", "Crane", "Cummins"), team = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("Australia",
"England"), class = "factor"), role = structure(c(1L, 3L, 4L,
3L, 2L, 3L, 3L, 2L, 3L, 3L), .Label = c("allrounder", "batsman",
"bowler", "wicketkeeper"), class = "factor"), innings = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("test_1_innings_1",
"test_1_innings_2", "test_2_innings_1", "test_2_innings_2", "test_3_innings_1",
"test_3_innings_2", "test_4_innings_1", "test_4_innings_2", "test_5_innings_1",
"test_5_innings_2"), class = "factor"), batting_num = c(6, 11,
7, 10, 1, NA, 9, 1, NA, 9), score = c(38, 5, 9, 14, 5, NA, 20,
2, NA, 42), balls_faced = c(102, 9, 24, 11, 19, NA, 32, 10, NA,
120)), row.names = c(NA, 10L), class = "data.frame")
Any help would be appreciated. Thanks
You need to group_by team to calculate the proportion and use pct in aes :
library(dplyr)
library(ggplot2)
ashes_df %>%
count(team, role) %>%
group_by(team) %>%
mutate(pct= prop.table(n) * 100) %>%
ggplot() + aes(team, pct, fill=role) +
geom_bar(stat="identity") +
ylab("Number of Participants") +
geom_text(aes(label=paste0(sprintf("%1.1f", pct),"%")),
position=position_stack(vjust=0.5)) +
ggtitle("England & Australia Team Make Up") +
theme_bw()

Grouped box plot

Here is what it looks like after those edits - lines but no boxes.
Reproducible code:
df <- data.frame(SampleID = structure(c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L),
.Label = c("C004", "C005", "C007", "C009", "C010",
"C011", "C013", "C027", "C028", "C029",
"C030", "C031", "C032", "C033", "C034",
"C035", "C036", "C042", "C043", "C044",
"C045", "C046", "C047", "C048", "C049",
"C058", "C086"), class = "factor"),
Sequencing.Depth = c(1L, 2612L, 5223L, 7834L, 10445L, 13056L, 15667L, 18278L,
20889L, 23500L),
Observed.OTUs = c(1, 213, 289.5, 338, 377.8, 408.9, 434.4, 453.8, 472.1, NA),
Mange = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
.Label = c("N", "Y"), class = "factor"),
SpeciesCode = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
.Label = c("Cla", "Ucin", "Vvu"), class = "factor"))
In your aes, you can use interaction of your x values and your categorical values for plotting boxplot on a continuous x axis and pass position = "identity" in order to place them on the precise x values and not to be dodged.
Here to add the line connecting each boxplot, I calculate mean per Species per x values using dplyr directly inggplot but you can calculate outside and generate a second dataframe.
So, as your x values are pretty spread from 1 to 23500, you will have to modify the width of the geom_boxplot in order to see a box and not a single line:
library(ggplot2)
library(dplyr)
ggplot(df,aes(x = Xvalues, y = Yvalues, color = Species,
group = interaction(Species, Xvalues)))+
geom_boxplot(position = "identity", width = 1000)+
geom_line(data = df %>%
group_by(Xvalues, Species) %>%
summarise(Mean = mean(Yvalues)),
aes(x = Xvalues, y = Mean,
color = Species, group = Species))
So, apply to your dataset (based on informations you provided in your code), you should try something like:
library(ggplot2)
library(dplyr)
ggplot(observedotusrare,
aes(x=Sequencing.Depth, y=Observed.OTUs,
color=SpeciesCode,
group = interaction(Sequencing.Depth, SpeciesCode))) +
geom_boxplot(position = "identity", width = 1000) +
geom_line(data = observedotusrare %>%
group_by(Sequencing.Depth, SpeciesCode) %>%
summarise(Mean = mean(Observed.OTUs, na.rm = TRUE)),
aes(x = Sequencing.Depth, y = Mean,
color = SpeciesCode, group = SpeciesCode))
Does it answer your question ?
Reproducible example
df <- data.frame(Xvalues = rep(c(10,2000,23500), each = 30),
Species = rep(rep(LETTERS[1:3], each = 10),3),
Yvalues = c(rnorm(10,1,1),
rnorm(10,5,1),
rnorm(10,8,1),
rnorm(10,5,1),
rnorm(10,8,1),
rnorm(10,12,1),
rnorm(10,20,1),
rnorm(10,30,1),
rnorm(10,50,1)))

Assigning tick marks frequency to discrete data axes in facet_grid

I'm having some trouble setting readable tick marks on my axes. The problem is that my data are at different magnitudes, so I'm not really sure how to go about it.
My data include ~400 different products, with 3/4 variables each, from two machines. I've pre-processed it into a data.table and used gather to convert it to long form- that part is fine.
Overview: Data is discrete, each X_________ on the x-axis represents a separate reading, and its relative values from machine 1/2 - the idea is to compare the two. The graphical format is perfect for my needs, I would just like to set the ticks at say, every 10 products on the x-axes, and at reasonable values on the y-axis.
Y_1: from 150 to 250
Y_2: from say, 1.5* to 2.5
Y_3: from say, 0.8* to 2.3
Y_4: from say, 0.4* to 1.5
*Bottom value, rounded down
Here's the code I'm using so far
var.Parameter <- c("Var1", "Var2", "Var3", "Var4")
MProduct$Parameter <- factor(MProduct$Parameter,
labels = var.Parameter)
labels_x <- MProduct$Lot[seq(0, 1626, by= 20)]
labels_y <- MProduct$Value[seq(0, 1626, by= 15)]
plot.MProduct <- ggplot(MProduct, aes(x = Lot,
y = Value,
colour = V4)) +
facet_grid(Parameter ~.,
scales = "free_y") +
scale_x_discrete(breaks=labels_x) +
scale_y_discrete(breaks=labels_y) +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (angle = 90,
hjust = 1,
vjust = 0.5))
# ggsave("MProduct.png")
plot.MProduct
Anyone knows how to possibly render this graph more readable? Setting labels/breaks manually greatly limits flexibility and readability - there should be an option to set it to every X ticks, right? Same with y.
I need to apply this as a function to multiple datasets, so I'm not very happy about having to specify the column length of the "gathered" dataset every time either, which, in this case is 1626.
Since I'm here, I would also like to take the opportunity to ask about this code:
var.Parameter <- c("Var1", "Var2", "Var3", "Var4")
More often than not, I need to label my data in a specific order, which is not necessarily alphabetical. R, however, defaults to some kind of odd behaviour whereupon I have to plot and verify that the labels are indeed where they should be. Any clue how I could force them to be presented in order? As it is, my solution is to keep shifting their position in that line of code until it produces the graph correctly.
Many thanks.
Okay. I'm going to ignore the y axis labels because the defaults seem to work just fine as long as you don't try to overwrite them with your custom labels_y thing. Just let the defaults do their work. For the X axis, we'll give a couple options:
(A) label every N products on X-axis. Looking at ?scale_x_discrete, we can set the labels to a function that takes all the level of the factor and returns the labels we want. So we'll write a functional that returns a function that returns every Nth label:
every_n_labeler = function(n = 3) {
function (x) {
ind = ((1:length(x)) - 1) %% n == 0
x[!ind] = ""
return(x)
}
}
Now let's use that as the labeler:
ggplot(df, aes(x = Lot,
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
scale_x_discrete(labels = every_n_labeler(3)) +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
You can change the every_n_labeler(3) to (10) to make it every 10th label.
(B) Maybe more appropriate, it seems like your x-axis is actually numeric, it just happens to have "X" in front of it, let's convert it to numeric and let the defaults do the labeling work:
df$time = as.numeric(gsub(pattern = "X", replacement = "", x = df$Lot))
ggplot(df, aes(x = time,
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
With your full x range, I imagine that would look nice.
(C) But who wants to read those 9-digit numbers? You're labeling the x-axis a "Time (s)", which makes me think it's actual a time, measured in seconds from some start time. I'll make up that your start time is 2010-01-01 and covert these seconds to actual times, and then we get a nice date-time scale:
ggplot(df_s, aes(x = as.POSIXct(time, origin = "2010-01-01"),
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
If this is the real meaning behind your data, then using a date-time axis is a big step up for readability. (Again, notice that we are not specifying the breaks, the defaults work quite well.)
Using this data (I subset your sample data down to 2 facets and used dput to make it copy/pasteable):
df = structure(list(Lot = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L, 1L), .Label = c("X180106482", "X180126485", "X180306523",
"X180526326"), class = "factor"), Value = c(201, 156, 253, 211,
178, 202.5, 203.4, 204.3, 205.2, 2.02, 2.17, 1.23, 1.28, 1.54,
1.28, 1.45, 1.61, 2.35, 1.34, 1.36, 1.67, 2.01, 2.06, 2.07, 2.19,
1.44, 2.19), Parameter = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Var 1", "Var 2", "Var 3", "Var 4"
), class = "factor"), Machine = structure(c(2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Machine 1", "Machine 2"), class = "factor"),
time = c(180106482, 180126485, 180306523, 180526326, 180106482,
180126485, 180306523, 180526326, 180106482, 180106482, 180126485,
180306523, 180526326, 180106482, 180126485, 180306523, 180526326,
180106482, 180106482, 180126485, 180306523, 180526326, 180106482,
180126485, 180306523, 180526326, 180106482)), row.names = c(NA,
-27L), class = "data.frame")

ggplot2 how to plot rows to multiple x-axis datapoints

I am trying to create this type of chart from the data on the left (arbitrary values for simplicity):
The goal is to plot variable X on the x-axis with the mean on the Y-axis and error bars equal to the standard error se.
The problem is that values 1-10 should be each be represented individually (blue curve), and that the values for A and B should be plotted on each of the 1-10 values (green and red line).
I can draw the curve if I manually save the data and manually copy the values for A and B to each value for X but this is not very time efficient. Is there a more elegant way to do this?
Thanks in advance!
EDIT: As suggested the code:
df <- structure(list(X = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 2L, 11L, 12L), .Label = c("1", "10", "2", "3", "4", "5",
"6", "7", "8", "9", "A", "B"), class = "factor"), mean = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 5.5, 6.5), sd = c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), se = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("X", "mean", "sd", "se"), class = "data.frame", row.names = c(NA,-12L))
df<-as.data.frame(df)
df$X<-factor(df$X)
plot <- ggplot(df, aes(x=df$X, y=df$mean)) + geom_point() + geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.1)
plot
Im afraid I don't know ggplot, but hopefully this is what you want (it might also aid others in understanding your question).
You want a ggplot with three lines,
1. df$X,df$mean
2. df$X,df$row_A_mean
3. df$X,df$row_B_mean
4. error bars of the SE column
df <- structure(list(X = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 2L, 11L, 12L), .Label = c("1", "10", "2", "3", "4", "5",
"6", "7", "8", "9", "A", "B"), class = "factor"), mean = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 5.5, 6.5), sd = c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), se = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("X", "mean", "sd", "se"), class = "data.frame", row.names = c(NA,-12L))
df<-as.data.frame(df)
df$X<-factor(df$X)
plot <- ggplot(df, aes(x=df$X, y=df$mean)) + geom_point() + geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.1)
plot
#row A mean
df$row_A_mean<-rep(df[11,]$mean,nrow(df))# note that this could also be replaces by a horizontal line, unless the mean changes
#row A sd
df$row_A_sd<-rep(df[11,]$sd,nrow(df))
plot(as.numeric(df$X),df$mean,type="p",col="red")
lines(as.numeric(df$X),df$row_A_mean,col="green")
If we use a subset to define the data elements of the ggplot, we can come up with one solution using geom_hline:
theme_set(theme_bw())
ggplot(data = df[1:10,])+
geom_errorbar(aes(x = X, ymin = mean - se, ymax = mean + se))+
geom_point(aes(x = X, y = mean))+
geom_line(aes(x = X, y = mean), group = 1)+
geom_hline(data = df[11,], aes(yintercept = mean, colour = 'A'))+
geom_hline(data = df[12,], aes(yintercept = mean, colour = 'B'))
It's helpful to reorient your data into long form so that you can really utilize the aesthetic part of ggplot. Generally I would use reshape2::melt for this, but your data the way it's currently formatted doesn't really lend itself to it. I'll show you what I mean by long form and you can get the idea what we're shooting for:
#setting variables for your classes so it's a bit more scalable - reset as applicable
x.seriesLength <- 10
x.class.name <- "X" #name of the main series class; X in your example
a.vec <- c(5.5, 1, 1, "A")
b.vec <- c(6.5, 1, 1, "B")
#trimming df so we can reshape
df <- df[1:x.seriesLength, 2:4]
df$class <- x.class.name #adding class column
#converting your static A and B values to long form, sending to a data.frame and adding to df
add <- matrix(c(rep(a.vec, times = x.seriesLength),
rep(b.vec, times = x.seriesLength)),
byrow = T,
ncol = 4)
colnames(add) <- c("mean", "sd", "se", "class")
df <- rbind(df, add)
print(df)
Then we need to do a bit more cleaning:
df$rownum <- rep(1:x.seriesLength, times = 3)
df[,1:3] <- sapply(df[,1:3], as.numeric) #casting as numeric
df$barmin <- df$mean - df$sd
df$barmax <- df$mean + df$sd
Now we have a long form data frame with the required data. We can then use the new class column to plot and color multiple series.
#use class column to tell ggplot which points belong to which series
g <- ggplot(data = df) +
geom_point(aes(x = rownum, y = mean, color = class)) +
geom_errorbar(aes(x = rownum, ymin=barmin, ymax=barmax, color = class), width=.1)
g
Edit: If you want lines instead of points, just replace geom_point with geom_line.

Resources