Why does ggplot2 draw a vertical line between points? - r

I've searched SO and the internet wide and far, but somehow can't find a reason or solution to this problem. When plotting time-series type data using ggplot2 I always seem to have a vertical line connecting my points instead of the points being plotted singularly and simply connected via lines over time. Here's an example using mpg.
require(ggplot2)
gg <- ggplot(mpg, aes(x=year, y=cty,
group=manufacturer, colour=manufacturer))
gg + geom_point() + geom_line()
Is there any way to have the vertical line connecting the points removed? And why does ggplot2 do this? Thanks for your help in advance!
EDITED BASED ON DOWN VOTE AND QUESTIONS BELOW.
Perhaps mpg wasn't the best dataset to use as an example. I have multiple observations for individuals at defined time points which I want to plot by combining geom_point() and geom_line(). However, at each time point my individual observations (points) are also connected with a vertical line - which I do not know what it means and how it can be removed. Is it because I have multiple observations for the same individual at the same time-point?
Here's a dataset that helps illustrate the problem.
dput(x1)
structure(list(Assessment_Time = structure(c(1L, 2L, 1L, 1L,
2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 1L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 4L, 4L,
6L, 6L, 2L, 3L, 3L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L), .Label = c("Initial",
"First follow-up", "Second follow-up", "Third follow-up", "Fourth follow-up",
"Fifth follow-up"), class = "factor"), id = c(454316L, 454316L,
1184099L, 1184099L, 1184099L, 1184099L, 1184099L, 1184099L, 1184099L,
1184099L, 124227L, 124227L, 124227L, 124227L, 124227L, 124227L,
124227L, 124227L, 124227L, 124227L, 124227L, 124227L, 124227L,
124227L, 1227808L, 1227808L, 1234280L, 1234280L, 1234280L, 1234280L,
1233898L, 1233898L, 1233898L, 1233898L, 1233898L, 1233898L, 1233898L,
1233898L, 1191086L, 1191086L, 1191086L, 1232973L, 1232973L, 1232973L,
1232973L, 1232973L, 1232973L, 1251251L, 1251251L, 1251251L),
US_thickest_um = c(3400, 1500, 7600, 6000, 6600, 4500, 6100,
4000, 6400, 3500, 2300, 2400, 3400, 2200, 1500, 2500, 2100,
1500, 2500, 1700, 1700, 3800, 2800, 2800, 2300, 1300, 6000,
3200, 3800, 1900, 5400, 6200, 2200, 3000, 1900, 2100, 1900,
2500, 4600, 2800, 2100, 3400, 1900, 2400, 1700, 2100, 1300,
2800, 4000, 3700)), .Names = c("Assessment_Time", "id", "US_thickest_um"
), row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"
))
gg <- ggplot(x1, aes(x=Assessment_Time, y=US_thickest_um, group=factor(id)))
gg + geom_point(aes(colour=factor(id))) + geom_line(aes(colour=factor(id)))

It's not totally clear what your goal is here, but let's say it is to compare the mean for each manufacturer in 1999 and 2008 in a way that also shows the variation by plotting the individual points.
You could do something like this, playing around with the options until you get it the way you want.
means <- mpg %>% dplyr::group_by(year, manufacturer) %>% dplyr::summarize(cty = mean(cty))
ggplot(mpg, aes(x=year, y = cty)) +
geom_jitter(aes(colour = manufacturer), width = 0.15) +
geom_line(data = means, aes(group = manufacturer, colour = manufacturer))

It's not clear what you're trying to do. You refer to time-series data but actually use something completely different: neither mpg nor your updated sample data are time-series data.
I assume you are asking about how to plot time-series data in ggplot and encode different time series in different coloured lines. Here is a simple example that should help you getting started.
First off, let's generate data for 10 time series.
ts <- replicate(
10,
ts(cumsum(1 + round(rnorm(100), 2)), start = c(1954, 7), frequency = 12),
simplify = FALSE)
We convert the ts objects into a list of data.frames.
lst <- lapply(setNames(ts, paste0("series_", 1:10)), function(x)
data.frame(Y = as.matrix(x), date = as.Date(as.yearmon(time(gnp)))))
We now plot data by mapping id to the colour aesthetic to show the 10 different time series as 10 differently coloured line graphs.
library(tidyverse)
dplyr::bind_rows(lst, .id = "id") %>%
ggplot(aes(date, Y, colour = as.factor(id))) +
geom_line()

You need to reconsider your plot design.
There is there is only two years. So this can't be a classic timeseries line chart.
library(tidyverse)
table(mpg$year)
year n
<int> <int>
1 1999 117
2 2008 117
One of the alternatives can be this
gg <- ggplot(mpg, aes(x=manufacturer, fill = as.factor(cyl)))
gg + geom_bar(stat = "count") +
facet_wrap(~year) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

Related

gganimate transition_reveal() with geom_line() breaking on the final frame?

I am trying to animate a line graph with multiple lines. It seems that there is an error with the gganimate package involving transition_reveal() that is causing the final frame to revert for all of the lines but one. This error is not present when not using gganimate. Here is the code:
df <- read.csv("test.csv", stringsAsFactors = TRUE)
anim <- ggplot(df, aes(Day, Accidents, group = State, color = State)) +
geom_line() +
transition_reveal(Day) +
ease_aes('cubic-in-out')
jiff <- animate(anim, fps = 24, duration = 5, start_pause = 0, end_pause = 72, height = 4, width = 7, units = "in", res = 150)
jiff
Here is the dput of the dataframe:
structure(list(State = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L), levels = c("A", "B", "C", "D"), class = "factor"),
Day = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
Accidents = c(5L, 2L, 5L, 6L, 1L, 2L, 6L, 8L, 4L, 10L, 2L,
4L)), class = "data.frame", row.names = c(NA, -12L))
Here is the output:
Regardless of the ending pause or how many values I have along the x-axis, the final frame will always look like this with only one line appearing as updated. Does anyone know why this might be happening?
UPDATE: Reverting the gganimate package from 1.0.8 to 1.0.7 did seem to do the trick after all.
The issue is in this line start_pause = 0, end_pause = 72,. Remove or adapt it:
anim <- ggplot(df, aes(Day, Accidents, group= State, color = State)) +
geom_line() +
transition_reveal(Day) +
ease_aes('cubic-in-out')
animate(anim, fps = 24, duration = 5,
height = 4, width = 7, units = "in", res = 150)

ggplot2 boxplots - How to group factors levels on the x-axis (and add reference lines for each group mean)

I have 30 plant species for which I have displayed the distributions of midday leaf water potential (lwp_md) using boxplots and the package ggplot2. But how do I group these species along the x-axis according to their leaf habits (e.g. Deciduous, Evergreen) as well as display a reference line indicating the mean lwp_md value for each leaf habit level?
I have attempted with the package forcats but really have no idea how to proceed with this one. I can't find anything after an extensive search online. The best I seem able to do is order species by some other function e.g. the median.
Below is an example of my code so far. Note I have used the packages ggplot2 and ggthemes:
library(ggplot2)
ggplot(zz, aes(x=fct_reorder(species, lwp_md, fun=median, .desc=T), y=lwp_md)) +
geom_boxplot(aes(fill=leaf_habit)) +
theme_few(base_size=14) +
theme(legend.position="top",
axis.text.x=element_text(size=8, angle=45, vjust=1, hjust =1)) +
xlab("Species") +
ylab("Maximum leaf water potential (MPa)") +
scale_y_reverse() +
scale_fill_discrete(name="Leaf habit",
breaks=c("DEC", "EG"),
labels=c("Deciduous", "Evergreen"))
Here's a subset of my data including 4 of my species (2 deciduous, 2 evergreen):
> dput(zz)
structure(list(id = 1:20, species = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L
), .Label = c("AMYELE", "BURSIM", "CASXYL", "COLARB"), class = "factor"),
leaf_habit = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("DEC",
"EG"), class = "factor"), lwp_md = c(-2.1, -2.5, -2.35, -2.6,
-2.45, -1.7, -1.55, -1.4, -1.55, -0.6, -2.6, -3.6, -2.9,
-3.1, -3.3, -2, -1.8, -2, -4.9, -5.35)), class = "data.frame", row.names = c(NA,
-20L))
An example of how I'm looking to display my data, cut and edited - I would like species on x-axis, lwp_md on y-axis:
gpplot defaults to ordering your factors alphabetically. To avoid this you have to supply them as ordered factors. This can be done by arranging the data.frame and then redeclaring the factors. To generate the mean value we can use group_by and mutate a new mean column in the df, that can later be plotted.
Here is the complete code:
library(ggplot)
library(ggthemes)
library(dplyr)
zz2 <- zz %>% arrange(leaf_habit) %>% group_by(leaf_habit) %>% mutate(mean=mean(lwp_md))
zz2$species <- factor(zz2$species,levels=unique(zz2$species))
ggplot(zz2, aes(x=species, y=lwp_md)) +
geom_boxplot(aes(fill=leaf_habit)) +
theme_few(base_size=14) +
theme(legend.position="top",
axis.text.x=element_text(size=8, angle=45, vjust=1, hjust =1)) +
xlab("Species") +
ylab("Maximum leaf water potential (MPa)") +
scale_y_reverse() +
scale_fill_discrete(name="Leaf habit",
breaks=c("DEC", "EG"),
labels=c("Deciduous", "Evergreen")) +
geom_errorbar(aes(species, ymax = mean, ymin = mean),
size=0.5, linetype = "longdash", inherit.aes = F, width = 1)

Assigning tick marks frequency to discrete data axes in facet_grid

I'm having some trouble setting readable tick marks on my axes. The problem is that my data are at different magnitudes, so I'm not really sure how to go about it.
My data include ~400 different products, with 3/4 variables each, from two machines. I've pre-processed it into a data.table and used gather to convert it to long form- that part is fine.
Overview: Data is discrete, each X_________ on the x-axis represents a separate reading, and its relative values from machine 1/2 - the idea is to compare the two. The graphical format is perfect for my needs, I would just like to set the ticks at say, every 10 products on the x-axes, and at reasonable values on the y-axis.
Y_1: from 150 to 250
Y_2: from say, 1.5* to 2.5
Y_3: from say, 0.8* to 2.3
Y_4: from say, 0.4* to 1.5
*Bottom value, rounded down
Here's the code I'm using so far
var.Parameter <- c("Var1", "Var2", "Var3", "Var4")
MProduct$Parameter <- factor(MProduct$Parameter,
labels = var.Parameter)
labels_x <- MProduct$Lot[seq(0, 1626, by= 20)]
labels_y <- MProduct$Value[seq(0, 1626, by= 15)]
plot.MProduct <- ggplot(MProduct, aes(x = Lot,
y = Value,
colour = V4)) +
facet_grid(Parameter ~.,
scales = "free_y") +
scale_x_discrete(breaks=labels_x) +
scale_y_discrete(breaks=labels_y) +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (angle = 90,
hjust = 1,
vjust = 0.5))
# ggsave("MProduct.png")
plot.MProduct
Anyone knows how to possibly render this graph more readable? Setting labels/breaks manually greatly limits flexibility and readability - there should be an option to set it to every X ticks, right? Same with y.
I need to apply this as a function to multiple datasets, so I'm not very happy about having to specify the column length of the "gathered" dataset every time either, which, in this case is 1626.
Since I'm here, I would also like to take the opportunity to ask about this code:
var.Parameter <- c("Var1", "Var2", "Var3", "Var4")
More often than not, I need to label my data in a specific order, which is not necessarily alphabetical. R, however, defaults to some kind of odd behaviour whereupon I have to plot and verify that the labels are indeed where they should be. Any clue how I could force them to be presented in order? As it is, my solution is to keep shifting their position in that line of code until it produces the graph correctly.
Many thanks.
Okay. I'm going to ignore the y axis labels because the defaults seem to work just fine as long as you don't try to overwrite them with your custom labels_y thing. Just let the defaults do their work. For the X axis, we'll give a couple options:
(A) label every N products on X-axis. Looking at ?scale_x_discrete, we can set the labels to a function that takes all the level of the factor and returns the labels we want. So we'll write a functional that returns a function that returns every Nth label:
every_n_labeler = function(n = 3) {
function (x) {
ind = ((1:length(x)) - 1) %% n == 0
x[!ind] = ""
return(x)
}
}
Now let's use that as the labeler:
ggplot(df, aes(x = Lot,
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
scale_x_discrete(labels = every_n_labeler(3)) +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
You can change the every_n_labeler(3) to (10) to make it every 10th label.
(B) Maybe more appropriate, it seems like your x-axis is actually numeric, it just happens to have "X" in front of it, let's convert it to numeric and let the defaults do the labeling work:
df$time = as.numeric(gsub(pattern = "X", replacement = "", x = df$Lot))
ggplot(df, aes(x = time,
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
With your full x range, I imagine that would look nice.
(C) But who wants to read those 9-digit numbers? You're labeling the x-axis a "Time (s)", which makes me think it's actual a time, measured in seconds from some start time. I'll make up that your start time is 2010-01-01 and covert these seconds to actual times, and then we get a nice date-time scale:
ggplot(df_s, aes(x = as.POSIXct(time, origin = "2010-01-01"),
y = Value,
colour = Machine)) +
facet_grid(Parameter ~ .,
scales = "free_y") +
geom_point() +
labs(title = "Product: Select Trends | 2018",
x = "Time (s)",
y = "Value") +
theme(axis.text.x = element_text (
angle = 90,
hjust = 1,
vjust = 0.5
))
If this is the real meaning behind your data, then using a date-time axis is a big step up for readability. (Again, notice that we are not specifying the breaks, the defaults work quite well.)
Using this data (I subset your sample data down to 2 facets and used dput to make it copy/pasteable):
df = structure(list(Lot = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L, 1L), .Label = c("X180106482", "X180126485", "X180306523",
"X180526326"), class = "factor"), Value = c(201, 156, 253, 211,
178, 202.5, 203.4, 204.3, 205.2, 2.02, 2.17, 1.23, 1.28, 1.54,
1.28, 1.45, 1.61, 2.35, 1.34, 1.36, 1.67, 2.01, 2.06, 2.07, 2.19,
1.44, 2.19), Parameter = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Var 1", "Var 2", "Var 3", "Var 4"
), class = "factor"), Machine = structure(c(2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Machine 1", "Machine 2"), class = "factor"),
time = c(180106482, 180126485, 180306523, 180526326, 180106482,
180126485, 180306523, 180526326, 180106482, 180106482, 180126485,
180306523, 180526326, 180106482, 180126485, 180306523, 180526326,
180106482, 180106482, 180126485, 180306523, 180526326, 180106482,
180126485, 180306523, 180526326, 180106482)), row.names = c(NA,
-27L), class = "data.frame")

Center Labels in Filled Bar Chart using geom_text

I am new to ggplot2 (and R) and am trying to make a filled bar chart with labels in each box indicating the percentage composing that block.
Here is an example of my current figure to which I would like to add labels:
##ggplot figure
library(gpplot2)
library(scales)
#specify order I want in plots
ZIU$Affinity=factor(ZIU$Affinity, levels=c("High", "Het", "Low"))
ZIU$Group=factor(ZIU$Group, levels=c("ZUM", "ZUF", "ZIM", "ZIF"))
ggplot(ZIU, aes(x=Group))+
geom_bar(aes(fill=Affinity), position="fill", width=1, color="black")+
scale_y_continuous(labels=percent_format())+
scale_fill_manual("Affinity", values=c("High"="blue", "Het"="lightblue", "Low"="gray"))+
labs(x="Group", y="Percent Genotype within Group")+
ggtitle("Genotype Distribution", "by Group")
I would like to add labels centered in each box with the percentage that box represents
I have tried to add labels using this code, but it keeps producing the error message "Error: geom_text requires the following missing aesthetics: y" but my plot has no y aesthetic, does this mean I cannot use geom_text? (Also, I am not sure if once the y aesthetic issue is resolved, if the remainder of the geom_text statement will accomplish what I desire, centered white labels in each box.)
ggplot(ZIU, aes(x=Group)) +
geom_bar(aes(fill=Affinity), position="fill", width=1, color="black")+
geom_text(aes(label=paste0(sprintf("%.0f", ZIU$Affinity),"%")),
position=position_fill(vjust=0.5), color="white")+
scale_y_continuous(labels=percent_format())+
scale_fill_manual("Affinity", values=c("High"="blue", "Het"="lightblue", "Low"="gray"))+
labs(x="Group", y="Percent Genotype within Group")+
ggtitle("Genotype Distribution", "by Group")
Also if anyone has suggestions for eliminating the NA values that would be appreciated! I tried
geom_bar(aes(fill=na.omit(Affinity)), position="fill", width=1, color="black")
but was getting the error "Error: Aesthetics must be either length 1 or the same as the data (403): fill, x"
dput(sample)
structure(list(Group = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L), .Label = c("ZUM", "ZUF", "ZIM", "ZIF"), class = "factor"),
StudyCode = c(1, 2, 3, 4, 5, 6, 20, 21, 22, 23, 143, 144,
145, 191, 192, 193, 194, 195, 196, 197, 10, 24, 25, 26, 27,
28, 71, 72, 73, 74, 274, 275, 276, 277, 278, 279, 280, 290,
291, 292), Affinity = structure(c(3L, 2L, 1L, 2L, 3L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 3L, 1L, 1L, 1L, 3L,
2L, 1L, 2L, 2L, 1L, 2L, 2L, 3L, 3L, 2L, 1L, 3L, 2L, 1L, 3L,
3L, 2L, 2L, 2L), .Label = c("High", "Het", "Low"), class = "factor")), .Names = c("Group",
"StudyCode", "Affinity"), row.names = c(NA, 40L), class = c("tbl_df",
"tbl", "data.frame"))
Thank you so much!
The linked examples have a y aesthetic, because the data are pre-summarized, rather than having ggplot do the counting internally. With your data, the analogous approach would be:
library(scales)
library(tidyverse)
# Summarize data to get counts and percentages
ZIU %>% group_by(Group, Affinity) %>%
tally %>%
mutate(percent=n/sum(n)) %>% # Pipe summarized data into ggplot
ggplot(aes(x=Group, y=percent, fill=Affinity)) +
geom_bar(stat="identity", width=1, color="black") +
geom_text(aes(label=paste0(sprintf("%1.1f", percent*100),"%")),
position=position_stack(vjust=0.5), colour="white") +
scale_y_continuous(labels=percent_format()) +
scale_fill_manual("Affinity", values=c("High"="blue", "Het"="lightblue", "Low"="gray")) +
labs(x="Group", y="Percent Genotype within Group") +
ggtitle("Genotype Distribution", "by Group")
Another option would be to use a line plot, which might make the relative values more clear. Assuming the Group values don't form a natural sequence, the lines are just there as a guide for differentiating the Affinity values across different values of Group.
ZIU %>% group_by(Group, Affinity) %>%
tally %>%
mutate(percent=n/sum(n)) %>% # Pipe summarized data into ggplot
ggplot(aes(x=Group, y=percent, colour=Affinity, group=Affinity)) +
geom_line(alpha=0.4) +
geom_text(aes(label=paste0(sprintf("%1.1f", percent*100),"%")), show.legend=FALSE) +
scale_y_continuous(labels=percent_format(), limits=c(0,1)) +
labs(x="Group", y="Percent Genotype within Group") +
ggtitle("Genotype Distribution", "by Group") +
guides(colour=guide_legend(override.aes=list(alpha=1, size=1))) +
theme_classic()

How to ggplot two groups of income-segment populations and values

I have a data frame which has two types of 'groups,' the densities of which I would like to overlay on the same graph.
using ggplot, I tried to graph the density using the following two lines of code:
full$group <- factor(full$group)
ggplot(full, aes(x=income, fill=group)) + geom_density()
The issue with this is that the it does not take the frequency variable (freq) into account, and simply calculates the frequency itself. That is an issue because there is exactly one row for every income-group combination.
I believe I have two options, each of which has a question:
a) Should I plot the graph using the way the data is currently formatted? If so, how would I do that?
b) Should I reformat the data to make the frequency of each group/income combination equivalent to the freq variable assigned to it? If so, how would I do that?
This is the kind of graph I would like, where "income" = "rating" and "group" = "cond":
dput of 'full':
full <- structure(list(income = c(10000, 19000, 29000, 39000, 49000, 75000, 99000, 1e+05, 10000, 19000,29000, 39000, 49000, 75000, 99000, 1e+05),
group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("one", "two"), class = "factor"),
freq = c(1237, 1791, 743, 291, 256, 212, 29, 11, 921, 1512, 614, 301, 209, 223, 48, 1)), .Names = c("income", "group", "freq"),
row.names = c(NA, 16L), class = "data.frame")
You can repeat the observations by their frequency with
ggplot(full[rep(1:nrow(full), full$freq),]) +
geom_density(aes(x=income, fill=group), color="black", alpha=.75, adjust=4)
Of course with your data this produces a pretty lousy plot
When estimating a density, your data should be observations from a continuous distribution. Here you really have a discrete distribution with repeated observations (in a true continuous distribution, the probability of seeing any value more than once is 0).
You could try to smooth this curve by setting the adjust= parameter to a number >1, (like 3 or 4). But really, your input data is just not in an appropriate form for a density plot. A bar plot would be a better choice. Maybe something like
ggplot(full, aes(as.factor(income), freq, fill=group)) +
geom_bar(stat="identity", position="dodge")

Resources