I am trying to overlay two scatter plots. Here is the base code:
ggplot() + geom_point(data = df, aes(A, B, color = Cluster), shape=1) +
geom_point(data = as.data.frame(centers), aes(A, B), shape=13, size=7, alpha = 5)
This is what the plot looks like:
But when I attempt to add a color to the overlaid cluster centers (those circles with X inside):
ggplot() + geom_point(data = df, aes(A, B, color = Cluster), shape=1) +
geom_point(data = as.data.frame(centers), aes(A, B, color = "red"), shape=13, size=7, alpha = 5)
I get the following error: "Error: Discrete value supplied to continuous scale"
Here is a portion of the dataframe I am using to plot the first of two overlays:
> df
A B Cluster
1 1.33300195 -1.4524680585 2
2 1.41102294 -0.7889431279 2
3 1.36350553 -1.4437548005 2
4 1.61462300 -0.7145174514 2
5 -0.64722704 0.8449845639 1
6 1.33855918 -0.9161504530 2
7 1.33467865 -2.1513899524 2
8 1.50842550 -0.5170262065 2
9 1.67045671 -0.3644476090 2
10 1.32328373 -1.5496692059 2
My theory is that ggplot is interpreting the "Cluster" column of that dataframe as a continuous variable. Is there a way to change it so its discrete? Should I instead use a column of colors as factors? For example: 1 becomes "Blue", 2 becomes "Black"?
This should work. No data for centers so can not add that to the plot. You are right in the fact that the continuous variable is messing the plot. Instead set it as factor() and use scale_color_manual() to change the colors. Here the code:
library(ggplot2)
#Code
ggplot() + geom_point(data = df, aes(A, B, color = factor(Cluster),
fill = factor(Cluster))) +
geom_point(data = as.data.frame(centers), aes(A, B, color = "red"),
shape=13, size=7, alpha = 5)+
scale_color_manual(values=c('blue','black'))+labs(color='Cluster',fill='Cluster')
Output:
Or keeping the original shape:
#Code 2
ggplot() + geom_point(data = df, aes(A, B, color = factor(Cluster)),shape=1) +
geom_point(data = as.data.frame(centers), aes(A, B, color = "red"),
shape=13, size=7, alpha = 5)+
scale_color_manual(values=c('blue','black'))+labs(color='Cluster')
Output:
Usually in publications, statistically significant differences are shown by putting * above the bar. I have a lot of bars in my plot and I was hoping to make significant ones different from the others by coloring it differently.
For example:
this is the dataset
some_data = data.frame(name = sample(LETTERS, 5),
value = rnorm(5, 5, 7),
pvalue = rnorm(5, 0.05, 0.02))
> some_data
name value pvalue
1 Q 8.8101784 0.01691628
2 Z 5.9426036 0.10228445
3 U 1.4862314 0.02062453
4 K -0.1365665 0.04405621
5 N 8.8828848 0.05992229
ggplot(some_data, aes(name, value)) +
geom_bar(stat = "identity") +
geom_text(aes(label=pvalue), position=position_dodge(width=0.9), vjust=-0.25)
What I want is to make the bars different colored if pvalue was more less than 0.05
ggplot aesthetics let you evaluate R code, which allows you to do stuff like this:
ggplot(some_data, aes(x = name, y = value, fill = pvalue < 0.05)) +
geom_col() +
geom_text(aes(label=pvalue), position=position_dodge(width=0.9), vjust=-0.25)
EDIT: Use geom_col instead of geom_bar(stat = 'identity') per Axeman's comment.
I am using ggplot geom_smooth to plot turnover data of a customer group from previous year against the current year (based on calendar weeks). As the last week is not complete, I would like to use a dashed linetype for the last week. However, I can't figure out how to that. I can either change the linetype for the entire plot or an entire series, but not within a series (depending on the value of x):
To keep it simple, let's just use the following example:
set.seed(42)
frame <- data.frame(series = rep(c('a','b'),50),x = 1:100, y = runif(100))
ggplot(frame,aes(x = x,y = y, group = series, color=series)) +
geom_smooth(size=1.5, se=FALSE)
How would I have to change this to get dashed lines for x >= 75?
The goal would be something like this:
Thx very much for any help!
Edit, 2016-03-05
Of course I fail when trying to use this method on the original plot. The Problem lies with the ribbon, which is calculated using stat_summary and a predefined function. I tried to use use stat_summary on the original data (mdf), and geom_line on the smooth_data. Even when I comment out everything else, I still get "Error: Continuous value supplied to discrete scale". I believe the problem comes from the fact that the original x value (Kalenderwoche) was discrete, whereas the new, smoothed x is continuous. Do I have to somehow transform one into the other? What else could I do?
Here is what I tried (condensed to the essential lines):
quartiles <- function(x) {
x <- na.omit(x) # remove NULL
median <- median(x)
q1 <- quantile(x,0.25)
q3 <- quantile(x,0.75)
data.frame(y = median, ymin = median, ymax = q3)
}
g <- ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable))+
geom_smooth(size=1.5, method="auto", se=FALSE)
# Take out the data for smooth line
smooth_data <- ggplot_build(g)$data[[1]]
ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable))+
stat_summary(fun.data = quartiles,geom="ribbon", colour="NA", alpha=0.25)+
geom_line(data=smooth_data, aes(x=x, y=y, group=group, colour=group, fill=group))
mdf looks like this:
str(mdf)
'data.frame': 280086 obs. of 5 variables:
$ konto_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ Kalenderwoche: Factor w/ 14 levels "2015-48","2015-49",..: 4 12 1 3 7 13 10 6 5 9 ...
$ variable : Factor w/ 2 levels "Umsatz","Umsatz Vorjahr": 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0 428.3 97.8 76 793.1 ...
There are many accounts (konto_id), and for each account and calendar week (Kalenderwoche), there is a current turnover value (Umsatz) and a turnover value from last year (Umsatz Vorjahr). I can provide a smaller version of the data.frame and the entire code, if required.
Thx very much for any help!
P.S. I am a total novice in R, so my code probably looks rather stupid to pros, sorry for that :(
Edit, 2016-03-06
I have uploaded a subset of the data (mdf):
mdf
The full code of the original graph is the following (looking somewhat weird with so little data, but that's not the point ;)
library(dtw)
library(reshape2)
library(ggplot2)
library(RODBC)
library(Cairo)
# custom breaks for X axis
breaks.custom <- unique(mdf$Kalenderwoche)[c(TRUE,rep(FALSE,0))]
# function called by stat_summary
quartiles <- function(x) {
x <- na.omit(x)
median <- median(x)
q1 <- quantile(x,0.25)
q3 <- quantile(x,0.75)
data.frame(y = median, ymin = median, ymax = q3)
}
# Positions for guidelines and labels
horizontal.center <- (length(unique(mdf$Kalenderwoche))+1)/2
kw.horizontal.center <- as.vector(sort(unique(mdf$Kalenderwoche))[c(horizontal.center-0.5,horizontal.center+0.5)])
vpos.P75.label <- max(quantile(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[1]],0.75)
,quantile(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[2]],0.75))+10
# use the higher P75 value of the two weeks around the center
vpos.mean.label <- min(mean(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[1]])
,mean(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[2]]))-10
vpos.median.label <- min(median(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[1]])
,median(mdf$value[mdf$Kalenderwoche==kw.horizontal.center[2]]))-10
hpos.vline <- which(as.vector(sort(unique(mdf$Kalenderwoche))=="2016-03"))
# custom colour palette (2 colors)
cbPaletteLine <- c("#DA2626", "#2626DA")
cbPaletteFill <- c("#F0A8A8", "#7C7CE9")
# ggplot
ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable))+
geom_smooth(size=1.5, method="auto", se=FALSE)+
# SE=FALSE to suppress drawing of the SE of the fit.SE of the data shall be used instead:
stat_summary(fun.data = quartiles,geom="ribbon", colour="NA", alpha=0.25)+
scale_x_discrete(breaks=breaks.custom)+
scale_colour_manual(values=cbPaletteLine)+
scale_fill_manual(values=cbPaletteFill)+
#coord_cartesian(ylim = c(0, 250)) +
theme(legend.title = element_blank(), title = element_text(face="bold", size=12))+
#scale_color_brewer(palette="Dark2")+
labs(title = "Tranche 1", x = "Kalenderwoche", y = "Konto-Umsatz [CHF]")+
geom_vline(xintercept = hpos.vline, linetype=2)+
annotate("text", x=horizontal.center, y=vpos.median.label, label = "Median", size=4)+
annotate("text", x=horizontal.center, y=vpos.mean.label, label= "Mean", size=4)+
annotate("text", x=horizontal.center, y=vpos.P75.label, label = "P75%", size=4)+
theme(axis.text.x=element_text(angle = 90, hjust = 0.5, vjust = 0.5))
Edit, 2016-03-06
The final plot now looks like this (thx, Jason!!)
I am not so sure how to smooth all data and use different line types for subsets by geom_smooth function. My idea is to pull out the data which ggplot used to construct the plot and use geom_line to reproduce it. This was the way I did it:
set.seed(42)
frame <- data.frame(series=rep(c('a','b'), 50),
x = 1:100, y = runif(100))
library(ggplot2)
g <- ggplot(frame, aes(x=x, y=y, color=series)) + geom_smooth(se=FALSE)
# Take out the data for smooth line
smooth_data <- ggplot_build(g)$data[[1]]
ggplot(smooth_data[smooth_data$x <= 76, ], aes(x=x, y=y, color=as.factor(group), group=group)) +
geom_line(size=1.5) +
geom_line(data=smooth_data[smooth_data$x >= 74, ], linetype="dashed", size=1.5) +
scale_color_discrete("Series", breaks=c("1", "2"), labels=c("a", "b"))
You're right. The problem is that you add a continuous x to a discrete x in the original layer. One way to deal with it is to create a lookup table which in this case, it is easy because x is a sequence from 1 to 14. We can transform discrete x by indexing. In your code, it should work if you add:
level <- levels(mdf$Kalenderwoche)
ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable))+
stat_summary(fun.data = quartiles,geom="ribbon", colour="NA", alpha=0.25) +
geom_line(data=smooth_data, aes(x=level[x], y=y, group=group, colour=as.factor(group), fill=NA))
Here is my attempt for the question:
g <- ggplot(mdf, aes(x=Kalenderwoche, y=value, group=variable, colour=variable,fill=variable)) +
geom_smooth(size=1.5, method="auto", se=FALSE) +
# SE=FALSE to suppress drawing of the SE of the fit.SE of the data shall be used instead:
stat_summary(fun.data = quartiles,geom="ribbon", colour="NA", alpha=0.25)
smooth_data <- ggplot_build(g)$data[[1]]
ribbon_data <- ggplot_build(g)$data[[2]]
# Use them as lookup table
level <- levels(mdf$Kalenderwoche)
clevel <- levels(mdf$variable)
ggplot(smooth_data[smooth_data$x <= 13, ], aes(x=level[x], y=y, group=group, color=as.factor(clevel[group]))) +
geom_line(size=1.5) +
geom_line(data=smooth_data[smooth_data$x >= 13, ], linetype="dashed", size=1.5) +
geom_ribbon(data=ribbon_data,
aes(x=x, ymin=ymin, ymax=ymax, fill=as.factor(clevel[group]), color=NA), alpha=0.25) +
scale_x_discrete(breaks=breaks.custom) +
scale_colour_manual(values=cbPaletteLine) +
scale_fill_manual(values=cbPaletteFill) +
#coord_cartesian(ylim = c(0, 250)) +
theme(legend.title = element_blank(), title = element_text(face="bold", size=12))+
#scale_color_brewer(palette="Dark2")+
labs(title = "Tranche 1", x = "Kalenderwoche", y = "Konto-Umsatz [CHF]")+
geom_vline(xintercept = hpos.vline, linetype=2)+
annotate("text", x=horizontal.center, y=vpos.median.label, label = "Median", size=4)+
annotate("text", x=horizontal.center, y=vpos.mean.label, label= "Mean", size=4)+
annotate("text", x=horizontal.center, y=vpos.P75.label, label = "P75%", size=4)+
theme(axis.text.x=element_text(angle = 90, hjust = 0.5, vjust = 0.5))
Note that the legend has borderline.
I would like to make a plot using facet_wrap where the axes can vary for each panel but within a panel the x and y axes should be the same scale.
e.g. see the following plots
df <- read.table(text = "
x y g
1 5 a
2 6 a
3 7 a
4 8 a
5 9 b
6 10 b
7 11 b
8 12 b", header = TRUE)
library(ggplot2)
ggplot(df, aes(x=x,y=y,g=g)) +
geom_point() +
facet_wrap(~ g) # all axes 1-12
ggplot(df, aes(x=x,y=y,g=g)) +
geom_point() +
facet_wrap(~ g, scales = "free")
# fee axes, y & y axes don't match per panel
What i want is for panel a the x and why axes both to be 1-8 and for panel b the x and y axes both to range from 5 - 12.
Is this possible?
Using this answer you could try the following:
dummy <- data.frame(x = c(1, 8, 5, 12), y = c(1, 8, 5, 12), g = c("a", "a", "b", "b"))
ggplot(df, aes(x=x,y=y)) +
geom_point() +
facet_wrap(~ g, scales = "free") +
geom_blank(data = dummy)
Another solution is trick the axes for individual facet_wrap() plots by adding invisible points to the plots with x and y reversed so that the plotted data is "square", e.g.,
library(ggplot2)
p <- ggplot(data = df) +
geom_point(mapping = aes(x = x, y = y)) +
geom_point(mapping = aes(x = y, y = x), alpha = 0) +
facet_wrap( ~ g, scales = "free")
print(p)
You could also use geom_blank(). You don't need dummy data.
This wasn't an option when the question was asked, but these days I would highly recommend patchwork for combining plots.
I have a ggplot with a geom_text():
geom_text(y = 4, aes(label = text))
The variable text has the following format:
number1-number2
I want to know if it is possible to define a color for the number1 and another color for number2 (example: red and green color)
Thanks!
One way is if you have for example the label texts of number1 and number2 as separate columns in the data frame:
ggplot(data, aes(x,y)) + geom_text(label=data[,3], color="red", vjust=0) + geom_text(label=data[,4], color="blue", vjust=1)
You may also try annotate:
# data for plot
df <- data.frame(x = 1:5, y = 1:5)
# data for annotation
no1 <- "number1"
no2 <- "number1"
x_annot <- 4
y_annot <- 5
dodge <- 0.3
ggplot(data = df, aes(x = x, y = y)) +
geom_point() +
annotate(geom = "text", x = c(x_annot - dodge, x_annot, x_annot + dodge), y = y_annot,
label = c(no1, "-", no2),
col = c("red", "black", "green")) +
theme_classic()
I defined the labels and positions outside the annotate call, which possibly makes it easier to generate these variables more dynamically, e.g. if "number1" in fact could be calculated from the original data set, or positions be based on range of x and y.