ggplot geom_bar where x = multiple columns - r

How can I go about making a bar plot where the X comes from multiple values of a data frame?
Fake data:
data <- data.frame(col1 = rep(c("A", "B", "C", "B", "C", "A", "A", "B", "B", "A", "C")),
col2 = rep(c(2012, 2012, 2012, 2013, 2013, 2014, 2014, 2014, 2015, 2015, 2015)),
col3 = rep(c("Up", "Down", "Up", "Up", "Down", "Left", "Right", "Up", "Right", "Down", "Up")),
col4 = rep(c("Y", "N", "N", "N", "Y", "N", "Y", "Y", "Y", "N", "Y")))
What I'm trying to do is plot the number (also, ideally, the percentage) of Y's and N's in col4 based on grouped by col1, col2, and col3.
Overall, if there are 50 rows and 25 of the rows have Y's, I should be able to make a graph that looks like this:
I know a basic barplot with ggplot is:
ggplot(data, aes(x = col1, fil = col4)) + geom_bar()
I'm not looking for how many of col4 is found per col3 by col2, though, so facet_wrap() isn't the trick, I think, but I don't know what to do instead.

You need to first convert your data frame into a long format, and then use the created variable to set the facet_wrap().
data_long <- tidyr::gather(data, key = type_col, value = categories, -col4)
ggplot(data_long, aes(x = categories, fill = col4)) +
geom_bar() +
facet_wrap(~ type_col, scales = "free_x")

A very rough approximation, hoping it'll spark conversation and/or give enough to start.
Your data is too small to do much, so I'll extend it.
set.seed(2)
n <- 100
d <- data.frame(
cat1 = sample(c('A','B','C'), size=n, replace=TRUE),
cat2 = sample(c(2012L,2013L,2014L,2015L), size=n, replace=TRUE),
cat3 = sample(c('^','v','<','>'), size=n, replace=TRUE),
val = sample(c('X','Y'), size=n, replace=TRUE)
)
I'm using dplyr and tidyr here to reshape the data a little:
library(ggplot2)
library(dplyr)
library(tidyr)
d %>%
tidyr::gather(cattype, cat, -val) %>%
filter(val=="Y") %>%
head
# Warning: attributes are not identical across measure variables; they will be dropped
# val cattype cat
# 1 Y cat1 A
# 2 Y cat1 A
# 3 Y cat1 C
# 4 Y cat1 C
# 5 Y cat1 B
# 6 Y cat1 C
The next trick is to facet it correctly:
d %>%
tidyr::gather(cattype, cat, -val) %>%
filter(val=="Y") %>%
ggplot(aes(val, fill=cattype)) +
geom_bar() +
facet_wrap(~cattype+cat, nrow=1)

Depending on what you want here, you can also achieve something like what you want using melt from the reshape package.
(NOTE: this solution is very similar to Phil's, and you could convert it to be just let his if you made col4 your fill instead, didn't filter by only "Y"s and included a facet wrap)
Following on from your data setup:
library(reshape)
#Reshape the data to sort it by all the other column's categories
data$col2 <- as.factor(as.character(data$col2))
breakdown <- melt(data, "col4")
#Our x values are the individual values, e.g. A, 2012, Down.
#Our fill is what we want it grouped by, in this case variable, which is our col1, col2, col3 (default column name from melt)
ggplot(subset(breakdown, col4 == "Y"), aes(x = value, fill = variable)) +
geom_bar() +
# scale_x_discrete(drop=FALSE) +
scale_fill_discrete(labels = c("Letters", "Year", "Direction")) +
ylab("Number of Yes's")
I'm not 100% sure what you want, but perhaps this is more like it?
EDIT
To show percentages of Yes's instead we can use ddply from the plyr package to create a data frame which has each of the variables with their yes percentages, then make the barplot plot a value rather than a count.
#The ddply applies a function to a data frame grouped by columns.
#In this case we group by our col1, col2 and col3 as well as the value.
#The function I apply just calculated the percentage, i.e. number of yeses/number of responses
plot_breakdown <- ddply(breakdown, c("variable", "value"), function(x){sum(x$col4 == "Y")/nrow(x)})
#When we plot we not add y = V1 to plot the percentage response
#Also in geom_bar I've now added stat = 'identity' so it doesn't try and plot counts
ggplot(plot_breakdown, aes(x = value, y = V1, fill = variable)) +
geom_bar(aes(group = factor(variable)), position = "dodge", stat = 'identity') +
scale_x_discrete(drop=FALSE) +
scale_fill_discrete(labels = c("Letters", "Year", "Direction")) +
ylab("Percentage of Yes's") +
scale_y_continuous(limits = c(0,1), breaks = seq(0,1,0.25), labels = c("0%", "25%", "50%", "75%", "100%"))
The last line I've added to the ggplot is to just make the y axis look a bit more percentage-y :)
In the comments you've mentioned you want to do this as the sample sizes are different and you want to give some kind of fair comparison between categories. My advice is to be careful here. Percentages look good, but can really misconstrue thing if sample sizes are small. To say 0% answered yes when you only got one response is heavily biased, for example. My advice here would be to either exclude columns with what you deem too small a sample size, or take advantage of the colour field.
#Adding an extra column using ddply again which generates a 1 if the sample size is less than 3, and a 0 otherwise
plot_breakdown <- cbind(plot_breakdown,
too_small = factor(ddply(breakdown, c("variable", "value"), function(x){ifelse(nrow(x)<3,1,0)})[,3]))
#Same ggplot as before, except with a colour variable now too (outside line of bar)
#Because of this I also added a way to customise the colours which display, and the names of the colour legend
ggplot(plot_breakdown, aes(x = value, y = V1, fill = variable, colour = too_small)) +
geom_bar(size = 2, position = "dodge", stat = 'identity') +
scale_x_discrete(drop=FALSE) +
labs(fill = "Variable", colour = "Too small?") +
scale_fill_discrete(labels = c("Letters", "Year", "Direction")) +
scale_colour_manual(values = c("black", "red"), labels = c("3+ response", "< 3 responses")) +
ylab("Percentage of Yes's") +
scale_y_continuous(limits = c(0,1), breaks = seq(0,1,0.25), labels = c("0%", "25%", "50%", "75%", "100%"))

If you actually group your Y's and N's by the other three columns, there will be one observation in each group. However, if you had repeated Y's and N's you could recode them to 1's and 0's, and get the percentage. Here's an example:
library(tidyverse)
data <- data.frame(col1 = rep(c("A", "B", "C", "B", "C", "A", "A", "B", "B", "A", "C")),
col2 = rep(c(2012, 2012, 2012, 2013, 2013, 2014, 2014, 2014, 2015, 2015, 2015)),
col3 = rep(c("Up", "Down", "Up", "Up", "Down", "Left", "Right", "Up", "Right", "Down", "Up")),
col4 = rep(c("Y", "N", "N", "N", "Y", "N", "Y", "Y", "Y", "N", "Y")))
data %>%
dplyr::group_by(col1,col2,col3) %>%
mutate(col4 = ifelse(col4 == "Y", 1,0)) %>%
dplyr::summarise(percentage = mean(col4)) %>%
ggplot(aes(x = col1, y = percentage, color = as.factor(col2), fill = col3)) +
geom_col(position = position_dodge(width = .5))

Related

Barplot with sorted dots

I want to plot a 4 group barplot from a first data-frame called df1 and display dots from another data-frame called df2. The idea is to check how many dots from df2 lie outside of df1.
So I made the following graph which works well.
### 0- Import package
library(dplyr)
### 1- Data simulation
set.seed(4)
df1 <- data.frame(var=c("a", "b", "c", "d"), value=c(15, 19, 18, 17))
df2 <- data.frame(var1=rep(c("a", "b", "c", "d"), each=20), value=rnorm(80, 15, 2), color=NA, fill=NA)
### 2- Coloring data (outside=red, inside=blue)
df2$fill <- case_when(
(df2$var1=="a" & df2$value>subset(df1, var=='a')$value) ~ "#e18b8b",
(df2$var1=="b" & df2$value>subset(df1, var=='b')$value) ~ "#e18b8b",
(df2$var1=="c" & df2$value>subset(df1, var=='c')$value) ~ "#e18b8b",
(df2$var1=="d" & df2$value>subset(df1, var=='d')$value) ~ "#e18b8b",
TRUE ~ "#8cbee2")
df2$color <- case_when(
(df2$var1=="a" & df2$value>subset(df1, var=='a')$value) ~ "#ca0d0d",
(df2$var1=="b" & df2$value>subset(df1, var=='b')$value) ~ "#ca0d0d",
(df2$var1=="c" & df2$value>subset(df1, var=='c')$value) ~ "#ca0d0d",
(df2$var1=="d" & df2$value>subset(df1, var=='d')$value) ~ "#ca0d0d",
TRUE ~ "#0c78ca")
### 3- Display plot
ggplot(aes(x=var, y=value), data=df1) + geom_bar(stat="identity", fill='#8cbee2', width=0.6) +
geom_point(data=df2, aes(x=var1, y=value), colour=df2$color, fill=df2$fill, position=position_jitter(width=0.05, height=0), shape=21, size=2)
In order to improve this graph, I would like to order dots from df2 displayed within each barplot group, kind of qqplot-shaped.
-First, this would allow to tell whether the amount of dots outside is huge or not compared to barplots.
-Second, this would allow to see distribution of inside & outside dots.
I have found the following link but it only deals with one data-frame and I am working with 2.
How to plot boxplots superimposed with sorted points using ggplot2
Do you have any clue on how to sort these dots?
EDIT
Result following Stephan's answer
One option to achieve your desired result would be to use position_dodge and a helper column. To this end first order your data by var1 and value, then add the helper column as an interaction of var1 and the row index or number. This helper column could then be mapped on the group aes to ensure that points are plotted in ascending order where the dodge gives the qqplot-like shape:
Note: I also used a different approach for the colors which uses a left_join and maps on the color and fill aes.
library(dplyr)
set.seed(4)
df1 <- data.frame(var = c("a", "b", "c", "d"), value = c(15, 19, 18, 17))
df2 <- data.frame(var1 = rep(c("a", "b", "c", "d"), each = 20), value = rnorm(80, 15, 2), color = NA, fill = NA)
df2 <- df2 %>%
left_join(df1, by = c("var1" = "var"), suffix = c("", "_df1")) %>%
arrange(var1, value) %>%
mutate(
var_dodge = interaction(var1, row_number()),
color = value > value_df1
)
library(ggplot2)
ggplot(aes(x = var, y = value), data = df1) +
geom_bar(stat = "identity", fill = "#8cbee2", width = 0.6) +
geom_point(
data = df2, aes(x = var1, y = value, group = var_dodge, color = color, fill = color),
position = position_dodge(width = .4), shape = 21, size = 2
) +
scale_color_manual(values = c("TRUE" = "#ca0d0d", "FALSE" = "#0c78ca")) +
scale_fill_manual(values = c("TRUE" = "#e18b8b", "FALSE" = "#8cbee2")) +
guides(fill = "none", color = "none")

How know exactly what is the correct order of the dataset using set_name function?

Recently I plotted a boxplot with 3 different datasets.
The plot is fine. But when I use the function set_names, I set the data in this order: "S", "M", and "E", instead the order is not the same.
Here is the code:
df <-
list(df_1v, df_2v, df_3v) %>%
set_names(c("S", "M", "E")) %>%
map_dfr(bind_rows, .id = "df") %>%
pivot_longer(-df)
So, here I set the order of the data frames that I use in the same order that when setting the function set_names.
However, this is the plot:
This plot shows the inverted order: "E", "M", and "S".
How can I know if the data is in the correct order without seeing each value of the data frame (the data is enormous)?
There is a function to know the exact order?
Only in case do you need it, here is the code for the boxplot:
ggplot(df)+
geom_boxplot(aes(x = name, y = value),
fill = "blue",
color = "blue",
alpha = 0.2,
notch = T,
notchwidth = 0.8)+
facet_wrap(~df, nrow = 1)
You can try this code -
library(tidyverse)
list(df_1v, df_2v, df_3v) %>%
set_names(c("S", "M", "E")) %>%
map_dfr(bind_rows, .id = "df") %>%
pivot_longer(-df) %>%
mutate(df = factor(df, unique(df))) %>%
ggplot() +
geom_boxplot(aes(x = name, y = value),
fill = "blue",
color = "blue",
alpha = 0.2,
notch = T,
notchwidth = 0.8) +
facet_wrap(~df, nrow = 1)
The order of plots is controlled by levels of factor variable in the data. By using factor(df, unique(df)) the levels are assigned based on their occurrence in the data so we get the order as we specified in set_names i.e c("S", "M", "E")

R Stacked percentage bar plot with percentage of two factor variables with ggplot

I am trying to plot two factor variables and label the results with % inside the plots.
I already checked this post and the links he/she provides :
How to center stacked percent barchart labels
The ggplot line you are seing here is actually from one of the posts recommended :
sex <- c("F","F","M", "M", "M", "F","M","F","F", "M", "M", "M", "M","F","F", "M", "M", "F")
behavior <- c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C", "B", "C", "A")
BehSex <- data.frame(sex, behavior)
ggplot(BehSex, aes(x= factor(sex), fill= factor(behavior), y = (..count..)/sum(..count..)))+
geom_bar() +
stat_bin(geom = "text",
aes(label = paste(round((..count..)/sum(..count..)*100), "%")),
vjust = 5)
However, when I use that line I get the following error :
Error: StatBin requires a continuous x variable: the x variable is
discrete. Perhaps you want stat="count"?
I tried using stat="count" inside the geom_bar() but it doesn't seem to work as expected.
Three questions:
1) What am I doing wrong?
2) How can I manage to plot what I want?
3) How can I plot: the % and then in another graph the counts?
Here's the plot that I have right now
Thank you in advance for your help!
Regarding the answer of your post you mentioned, you will have to display the percentage using position = position_stack().
Also, you can use dplyr package to get percentage from your dataframe. To my opinion, it makes easier then to display the labeling:
library(dplyr)
df <- BehSex %>% group_by(sex) %>% count(behavior) %>% mutate(Percent = n / sum(n)*100)
# A tibble: 6 x 4
# Groups: sex [2]
sex behavior n Percent
<fct> <fct> <int> <dbl>
1 F A 2 25
2 F B 3 37.5
3 F C 3 37.5
4 M A 4 40
5 M B 3 30
6 M C 3 30
Then, you can get your plot like this:
ggplot(df, aes(x = sex, y = Percent, fill = behavior))+
geom_bar(stat = "identity")+
geom_text(aes(label = paste(Percent,"%"), y = Percent),
position = position_stack(vjust = 0.5))+
coord_flip()+
labs(x = "Sex", y = "Percentage",fill = "Behavior")
Here's another approach using a bit of data prep with dplyr:
EDIT: added counts. To show one or the other just change the label.
library(dplyr)
BehSexSum <- BehSex %>%
count(sex, behavior) %>%
mutate(pct = n / sum(n),
pct_label = scales::percent(pct))
ggplot(BehSexSum, aes(x= sex, fill = behavior, y = pct)) +
geom_col() +
geom_text(aes(label = paste(pct_label, n, sep = "\n")),
lineheight = 0.8,
position = position_stack(vjust = 0.5)) +
scale_y_continuous(labels = scales::percent)
I think an easier approach to format the y-axis labels as percentage is using scale_y_continuous(labels = scales::percent), instead of using stat_bin(...). Therefore, the code can stay almost the same.
ggplot(BehSex, aes(x= factor(sex), fill= factor(behavior), y =(..count..)/sum(..count..)))+
geom_bar() +
#Set the y axis format as percentage
scale_y_continuous(labels = scales::percent)+
#Change the legend and axes names
labs(x = "Sex", y = "Percentage",fill = "Behavior")

How to point each plot to correct y axis (many plots, two y axes, in R with ggplot2)

So I have compared two groups with a third using a range of inputs. For each of the three groups I have a value and a confidence interval for a range of inputs. For the two comparisons I also have a p-value for that range of inputs. Now I would like to plot all five data series, but use a second axis for the p values.
I am able to do that except for one thing: how do I make sure that R knows which of the plots to assign to the second axis?
This is what it looks like now. The bottom two data series should be scaled up to the Y axis to the right.
ggplot(df) +
geom_pointrange(aes(x=x, ymin=minc, ymax=maxc, y=meanc, color="c")) +
geom_pointrange(aes(x=x, ymin=minb, ymax=maxb, y=meanb, color="b")) +
geom_pointrange(aes(x=x, ymin=mina, ymax=maxa, y=meana, color="a")) +
geom_point(aes(x=x, y=c, color="c")) +
geom_point(aes(x=x, y=b, color="b")) +
scale_y_continuous(sec.axis = sec_axis(~.*0.2))
df is a dataframe whose column names are all the variables you see listed above, all row values are the corresponding datapoints.
You can get what you want, staying true to Hadley's cannon and Grammar of Graphics gospel, if you transform your DF from wide to long, and employ a different aes (i.e. shape, color, fill) between means and CI.
You did not provide a reproducible example, so I employ my own. (Dput at the end of the post)
df2 <- df %>%
mutate(CatCI = if_else(is.na(CI), "", Cat)) # Create a categorical name to map the CI to the legend.
ggplot(df2, aes(x = x)) +
geom_pointrange(aes(ymin = min, ymax = max, y = mean, color = Cat), shape = 16) +
geom_point(data = dplyr::filter(df2,!is.na(CI)), ## Filter the NA within the CI
aes(y = (CI/0.2), ## Transform the CI's y position to fit the right axis.
fill = CatCI), ## Call a second aes the aes
shape = 25, size = 5, alpha = 0.25 ) + ## I changed shape, size, and fillto help with visualization
scale_y_continuous(sec.axis = sec_axis(~.*0.2, name = "P Value")) +
labs(color = "Linerange\nSinister Axis", fill = "P value\nDexter Axis", y = "Mean")
Result:
Dataframe:
df <- structure(list(Cat = c("a", "b", "c", "a", "b", "c", "a", "b",
"c", "a", "b", "c", "a", "b", "c"), x = c(2, 2, 2, 2.20689655172414,
2.20689655172414, 2.20689655172414, 2.41379310344828, 2.41379310344828,
2.41379310344828, 2.62068965517241, 2.62068965517241, 2.62068965517241,
2.82758620689655, 2.82758620689655, 2.82758620689655), mean = c(0.753611797661977,
0.772340941644911, 0.793970086962944, 0.822424652072316, 0.837015408776649,
0.861417383841253, 0.87023105762465, 0.892894201949377, 0.930096326498796,
0.960862178366363, 0.966600321596147, 0.991206984637544, 1.00714201832596,
1.02025006679944, 1.03650896186786), max = c(0.869753641121797,
0.928067675294351, 0.802815304215019, 0.884750162053761, 1.03609814491961,
0.955909854315582, 1.07113399603486, 1.02170928767791, 1.05504846273091,
1.09491706586801, 1.20235615364205, 1.12035782960649, 1.17387406039167,
1.13909154635088, 1.0581878034897), min = c(0.632638511783381,
0.713943701135991, 0.745868763626567, 0.797491261486603, 0.743382797144923,
0.827693203320894, 0.793417962991821, 0.796917421637021, 0.92942504556723,
0.89124101157585, 0.813058838839382, 0.91701749675892, 0.943744642652422,
0.912869230576973, 0.951734254896252), CI = c(NA, 0.164201137643034,
0.154868406784159, NA, 0.177948094206453, 0.178360305763648,
NA, 0.181862670931493, 0.198447350829814, NA, 0.201541499248143,
0.203737532636542, NA, 0.205196077692786, 0.200992205838595),
CatCI = c("", "b", "c", "", "b", "c", "", "b", "c", "", "b",
"c", "", "b", "c")), .Names = c("Cat", "x", "mean", "max",
"min", "CI", "CatCI"), row.names = c(NA, 15L), class = "data.frame")

ggplo2 in R: geom_segment displays different line than geom_line

Say I have this data frame:
treatment <- c(rep("A",6),rep("B",6),rep("C",6),rep("D",6),rep("E",6),rep("F",6))
year <- as.numeric(c(1999:2004,1999:2004,2005:2010,2005:2010,2005:2010,2005:2010))
variable <- c(runif(6,4,5),runif(6,5,6),runif(6,3,4),runif(6,4,5),runif(6,5,6),runif(6,6,7))
se <- c(runif(6,0.2,0.5),runif(6,0.2,0.5),runif(6,0.2,0.5),runif(6,0.2,0.5),runif(6,0.2,0.5),runif(6,0.2,0.5))
id <- 1:36
df1 <- as.data.table(cbind(id,treatment,year,variable,se))
df1$year <- as.numeric(df1$year)
df1$variable <- as.numeric(df1$variable)
df1$se <- as.numeric(df1$se)
As I mentioned in a previous question (draw two lines with the same origin using ggplot2 in R), I wanted to use ggplot2 to display my data in a specific way.
I managed to do so using the following script:
y1 <- df1[df1$treatment=='A'&df1$year==2004,]$variable
y2 <- df1[df1$treatment=='B'&df1$year==2004,]$variable
y3 <- df1[df1$treatment=='C'&df1$year==2005,]$variable
y4 <- df1[df1$treatment=='D'&df1$year==2005,]$variable
y5 <- df1[df1$treatment=='E'&df1$year==2005,]$variable
y5 <- df1[df1$treatment=='E'&df1$year==2005,]$variable
y6 <- df1[df1$treatment=='F'&df1$year==2005,]$variable
p <- ggplot(df1,aes(x=year,y=variable,group=treatment,color=treatment))+
geom_line(aes(y = variable, group = treatment, linetype = treatment, color = treatment),size=1.5,lineend = "round") +
scale_linetype_manual(values=c('solid','solid','solid','dashed','solid','dashed')) +
geom_point(aes(colour=factor(treatment)),size=4)+
geom_errorbar(aes(ymin=variable-se,ymax=variable+se),width=0.2,size=1.5)+
guides(colour = guide_legend(override.aes = list(shape=NA,linetype = c("solid", "solid",'solid','dashed','solid','dashed'))))
p+labs(title="Title", x="years", y = "Variable 1")+
theme_classic() +
scale_x_continuous(breaks=c(1998:2010), labels=c(1998:2010),limits=c(1998.5,2010.5))+
geom_segment(aes(x=2004, y=y1, xend=2005, yend=y3),colour='blue1',size=1.5,linetype='solid')+
geom_segment(aes(x=2004, y=y1, xend=2005, yend=y4),colour='blue1',size=1.5,linetype='dashed')+
geom_segment(aes(x=2004, y=y2, xend=2005, yend=y5),colour='red3',size=1.5,linetype='solid')+
geom_segment(aes(x=2004, y=y2, xend=2005, yend=y6),colour='red3',size=1.5,linetype='dashed')+
scale_color_manual(values=c('blue1','red3','blue1','blue1','red3','red3'))+
theme(text = element_text(size=12))
As you can see I used both geom_line and geom_segment to display the lines for my graph.
It's almost perfect but if you look closely, the segments that are drawn (between 2004 and 2005) do not display the same line size, even though I used the same arguments values in the script (i.e. size=1.5 and linetype='solid' or dashed).
Of course I could change manually the size of the segments to get similar lines, but when I do that, segments are not as smooth as the lines using geom_line.
Also, I get the same problem (different line shapes) by including the size or linetype arguments within the aes() argument.
Do you have any idea what causes this difference and how I can get the exact same shapes for both my segments and lines ?
It seems to be an anti-aliasing issue with geom_segment, but that seems like a somewhat cumbersome approach to begin with. I think I have resolved your issue by duplicating the A and B treatments in the original data frame.
# First we are going to duplicate and rename the 'shared' treatments
library(dplyr)
library(ggplot2)
df1 %>%
filter(treatment %in% c("A", "B")) %>%
mutate(treatment = ifelse(treatment == "A",
"AA", "BB")) %>%
bind_rows(df1) %>% # This rejoins with the original data
# Now we create `treatment_group` and `line_type` variables
mutate(treatment_group = ifelse(treatment %in% c("A", "C", "D", "AA"),
"treatment1",
"treatment2"), # This variable will denote color
line_type = ifelse(treatment %in% c("AA", "BB", "D", "F"),
"type1",
"type2")) %>% # And this variable denotes the line type
# Now pipe into ggplot
ggplot(aes(x = year, y = variable,
group = interaction(treatment_group, line_type), # grouping by both linetype and color
color = treatment_group)) +
geom_line(aes(x = year, y = variable, linetype = line_type),
size = 1.5, lineend = "round") +
geom_point(size=4) +
# The rest here is more or less the same as what you had
geom_errorbar(aes(ymin = variable-se, ymax = variable+se),
width = 0.2, size = 1.5) +
scale_color_manual(values=c('blue1','red3')) +
scale_linetype_manual(values = c('dashed', 'solid')) +
labs(title = "Title", x = "Years", y = "Variable 1") +
scale_x_continuous(breaks = c(1998:2010),
limits = c(1998.5, 2010.5))+
theme_classic() +
theme(text = element_text(size=12))
Which will give you the following
My numbers are different since they were randomly generated.
You can then modify the legend to your liking, but my recommendation is using something like geom_label and then be sure to set check_overlap = TRUE.
Hope this helps!

Resources