Adding a line to a ggplot2 plot and tweaking legend - r

I'm using ggplot2 to show points colored by value. In addition, I want to show a regression line on this data.
This is an example of the data that I am using:
structure(list(a = c(63.635707116462, 59.7200565823145, 56.0311239027684,
53.1573088984712, 51.0192317467653, 48.0727441921859, 47.1516684444444,
45.5081981068289, 43.5874967485549, 43.3163255512322), b = c(278.983796321269,
254.833332215134, 234.812503036992, 221.519477352253, 212.013474843663,
199.926648466351, 194.577007436116, 186.506133515809, 179.411968705754,
172.056487287103), col = c(18.36245, 22.03494, 25.70743, 29.37992,
33.05241, 36.7249, 40.39739, 44.06988, 47.74237, 51.41486), predict = c(275.438415187452,
256.049214397717, 237.782656695549, 223.552332598712, 212.965175538386,
198.374997400175, 193.814089203754, 185.676086057123, 176.165312823424,
174.82254927815)), .Names = c("a", "b", "col", "predict"), row.names = c(NA,
-10L), class = "data.frame")
And the code I am using so far is as follows:
p <- ggplot(data = df, aes(x = a, y = b, colour=col)) + geom_point()
p + stat_smooth(method = "lm", formula = y ~ x, se = FALSE)
However, this does not produce a straight line (as it is smoothed) so instead I tried to follow one of the examples on ggplot2 (which is using qplot) and did the following:
model <- lm(b ~ a, data = df)
df$predict <- stats::predict(model, newdata=df)
p <- ggplot(data = df, aes(x = a, y = b, colour=col) ) + geom_point()
p + geom_line(aes(x = a, y = predict))
In the example, a line is added using + geom_line(data=grid), which in my case would be + geom_line(data=df). This just joins the points together, instead of drawing a straight line on the plot. How can I plot a line on this plot that is perfectly straight?
The other problem I was having with the plot, is renaming the legend. I want to have a two word title for the data (e.g. 'Z Density'), but I don't know how to change it. I've tried using + scale_colour_discrete(name = "Fancy Title") and + scale_linetype_discrete(name = "Fancy Title") using advice from this question but they do not work as my data is colored by a value.

As #Andrie says, using method = "lm" gives a linear model. As for your second question, use scale_color_continuous()
p <- ggplot(data = df, aes(x = a, y = b, colour=col)) + geom_point()
p + stat_smooth(method = "lm", se = FALSE) +
scale_colour_continuous(name = "My Legend")
You also don't need to do all of the predicting. ggplot() will do this for you, which is one of the great benefits.

Related

Adding a regression trend line and a shaded standard error area to a ggplot for regression models that geom_smooth does not handle

I have a data.frame with observed success/failure outcomes per two groups along with expected probabilities:
library(dplyr)
observed.probability.df <- data.frame(group = c("A","B"), p = c(0.4,0.6))
expected.probability.df <- data.frame(group = c("A","B"), p = qlogis(c(0.45,0.55)))
observed.data.df <- do.call(rbind,lapply(c("A","B"), function(g)
data.frame(group = g, value = c(rep(0,1000*dplyr::filter(observed.probability.df, group != g)$p),rep(1,1000*dplyr::filter(observed.probability.df, group == g)$p)))
)) %>% dplyr::left_join(expected.probability.df)
observed.probability.df$group <- factor(observed.probability.df$group, levels = c("A","B"))
observed.data.df$group <- factor(observed.data.df$group, levels = c("A","B"))
I'm fitting a logistic regression (binomial glm with a logit link function) to these data with the offset term:
fit <- glm(value ~ group + offset(p), data = observed.data.df, family = binomial(link = 'logit'))
Now, I'd like to plot these data as a bar graph using ggplot2's geom_bar, color-coded by group, and to add to that the trend line and shaded standard error area estimated in fit.
I'd use stat_smooth for that but I don't think it can handle the offset term in it's formula, so looks like I need to resort to assembling this figure in an alternative way.
To get the bars and the trend line I used:
slope.est <- function(x, ests) plogis(ests[1] + ests[2] * x)
library(ggplot2)
ggplot(observed.probability.df, aes(x = group, y = p, fill = group)) +
geom_bar(stat = 'identity') +
stat_function(fun = slope.est,args=list(ests=coef(fit)),size=2,color="black") +
scale_x_discrete(name = NULL,labels = levels(observed.probability.df$group), breaks = sort(unique(observed.probability.df$group))) +
theme_minimal() + theme(legend.title = element_blank()) + ylab("Fraction of cells")
So the question is how to add to that the shaded standard error around the trend line?
Using stat_function I am able to shade the entire area from the upper bound of the standard error all the way down to the X-axis:
ggplot(observed.probability.df, aes(x = group, y = p, fill = group)) +
geom_bar(stat = 'identity') +
stat_function(fun = slope.est,args=list(ests=coef(fit)),size=2,color="black") +
stat_function(fun = slope.est,args=list(ests=summary(fit)$coefficients[,1]+summary(fit)$coefficients[,2]),geom='area',fill="gray",alpha=0.25) +
scale_x_discrete(name = NULL,labels = levels(observed.probability.df$group), breaks = sort(unique(observed.probability.df$group))) +
theme_minimal() + theme(legend.title = element_blank()) + ylab("Fraction of cells")
Which is close but not quite there.
Any idea how to subtract from the shaded area above the area that's below the lower bound of the standard error? Perhaps geom_ribbon is the way to go here, but I don't know how to combine it with the slope.est function

R how to use geom_smooth with transformed Y axis and outside data set?

How do you use geom_smooth() using a formula with the following form:
log(Unit.Sales_1) ~ log(Price) + A
On top of a ggplot that is using a completely different dataset?
I'm currently able to use a transformed axis, a different dataset, but not both at the same time. I get the following error message:
Computation failed in stat_smooth(): object 'Unit.Sales_1' not found
And the rest of my ggplot looks like this:
ggplot() +
geom_point(data = hist_data1, aes(x = Price, y = Unit.Sales, color = "Historicals")) +
geom_line(data = est_data1, aes(x = P1, y = Q1, color = "Estimated")) +
geom_smooth(data = wide_oj_data,
formula = Unit.Sales_1 ~ log(Price_1) + log(Price_3) + Promotion_1 + Holiday,
method = "glm",
method.args = list(family = gaussian(link = 'log')),
aes(x = Price_1, y = Unit.Sales_1)
)
Thank you :)

Overlapping Trend Lines in scatterplots, R

I am trying to overlay multiple trend lines using the geom_smooth() in R. I currently have this code.
ggplot(mtcars2, aes(x=Displacement, y = Variable, color = Variable))
+ geom_point(aes(x=mpg, y = hp, col = "Power"))
+ geom_point(aes(x=mpg, y = drat, col = "Drag Coef."))
(mtcars2 is the normalized form of mtcars)
Which give me this graph.
I am trying to use the geom_smooth(method='lm') to draw two trend lines for the the two variables. Any ideas?
(Bonus: I would also like to implement the 'shape=1' paramater to differentiate the varaibles if possible. The following method does not work)
geom_point(aes(x=mpg, y = hp, col = "Power", shape=2))
Update
I managed to do this.
ggplot(mtcars2, aes(x=Displacement, y = Variable, color = Variable))
+ geom_point(aes(x=disp, y = hp, col = "Power"))
+ geom_point(aes(x=disp, y = mpg, col = "MPG"))
+ geom_smooth(method= 'lm',aes(x=disp, y = hp, col = "Power"))
+ geom_smooth(method= 'lm',aes(x=disp, y = mpg, col = "MPG"))
It looks like this.
But this is an ugly piece of code. If anybody can make this code look prettier, it'd be great. Also, I have not yet been able to implement the 'shape=2' parameter.
It seems like you're making your life harder than it needs to be...you can pass in additional parameters into aes() such as group and shape.
I don't know if I got your normalization right, but this should give you enough to get going in the right direction:
library(ggplot2)
library(reshape2)
#Do some normalization
mtcars$disp_norm <- with(mtcars, (disp - min(disp)) / (max(disp) - min(disp)))
mtcars$hp_norm <- with(mtcars, (hp - min(hp)) / (max(hp) - min(hp)))
mtcars$drat_norm <- with(mtcars, (drat - min(drat)) / (max(drat) - min(drat)))
#Melt into long form
mtcars.m <- melt(mtcars, id.vars = "disp_norm", measure.vars = c("hp_norm", "drat_norm"))
#plot
ggplot(mtcars.m, aes(disp_norm, value, group = variable, colour = variable, shape = variable)) +
geom_point() +
geom_smooth(method = "lm")
Yielding:

Most succinct way to label/annotate extreme values with ggplot?

I'd like to annotate all y-values greater than a y-threshold using ggplot2.
When you plot(lm(y~x)), using the base package, the second graph that pops up automatically is Residuals vs Fitted, the third is qqplot, and the fourth is Scale-location. Each of these automatically label your extreme Y values by listing their corresponding X value as an adjacent annotation. I'm looking for something like this.
What's the best way to achieve this base-default behavior using ggplot2?
Updated scale_size_area() in place of scale_area()
You might be able to take something from this to suit your needs.
library(ggplot2)
#Some data
df <- data.frame(x = round(runif(100), 2), y = round(runif(100), 2))
m1 <- lm(y ~ x, data = df)
df.fortified = fortify(m1)
names(df.fortified) # Names for the variables containing residuals and derived qquantities
# Select extreme values
df.fortified$extreme = ifelse(abs(df.fortified$`.stdresid`) > 1.5, 1, 0)
# Based on examples on page 173 in Wickham's ggplot2 book
plot = ggplot(data = df.fortified, aes(x = x, y = .stdresid)) +
geom_point() +
geom_text(data = df.fortified[df.fortified$extreme == 1, ],
aes(label = x, x = x, y = .stdresid), size = 3, hjust = -.3)
plot
plot1 = ggplot(data = df.fortified, aes(x = .fitted, y = .resid)) +
geom_point() + geom_smooth(se = F)
plot2 = ggplot(data = df.fortified, aes(x = .fitted, y = .resid, size = .cooksd)) +
geom_point() + scale_size_area("Cook's distance") + geom_smooth(se = FALSE, show_guide = FALSE)
library(gridExtra)
grid.arrange(plot1, plot2)

Using ggplot2 how can I plot points with an aes() after plotting lines?

I'm using ggplot2 to show lines and points on a plot. What I am trying to do is to have the lines all the same color, and then to show the points colored by an attribute. My code is as follows:
# Data frame
dfDemo <- structure(list(Y = c(0.906231077471568, 0.569073561538186,
0.0783433165521566, 0.724580209473378, 0.359136092118470, 0.871301974471722,
0.400628333618918, 1.41778205350433, 0.932081770977729, 0.198188442350644
), X = c(0.208755495088456, 0.147750173706688, 0.0205864576474412,
0.162635017485883, 0.118877260137735, 0.186538613831806, 0.137831912094464,
0.293293029083812, 0.219247919537514, 0.0323148791663826), Z = c(11112951L,
11713300L, 14331476L, 11539301L, 12233602L, 15764099L, 10191778L,
12070774L, 11836422L, 15148685L)), .Names = c("Y", "X", "Z"
), row.names = c(NA, 10L), class = "data.frame")
# Variables
X = array(0.1,100)
Y = seq(length=100, from=0, by=0.01)
# make data frame
dfAll <- data.frame()
# make data frames using loop
for (x in c(1:10)){
# spacemate calc
Floors = array(x,100)
# include label
Label = paste(' ', toString(x), sep="")
df1 <- data.frame(X = X * x, Y = Y, Label)
# merge df1 to cumulative df, dfAll
dfAll <- rbind(dfAll, df1)
}
# plot
pl <- ggplot(dfAll, aes(x = X, y = Y, group = Label, colour = 'Measures')) + geom_line()
# add points to plot
pl + geom_point(data=dfDemo, aes(x = X, y = Y)) + opts(legend.position = "none")
This almost works, but I am unable to color the points by Z when I do this. I can plot the points separately, colored by Z using the following code:
ggplot(dfDemo, aes(x = X, y = Y, colour = Z)) + geom_point()
However, if I use the similar code after plotting the lines:
pl + geom_point(data=dfDemo, aes(x = X, y = Y, colour = Z)) + opts(legend.position = "none")
I get the following error:
Error: Continuous variable () supplied to discrete scale_hue.
I don't understand how to add the points to the chart so that I can colour them by a value. I appreciate any suggestion how to solve this.
The issue is that they are colliding the two colour scales, one from the ggplot call and the other from geom_point. If you want the lines of one colour and the points of different colours then you need to erase the colour setting from ggplot call and put it inside the geom_line outside the aes call so it isn't mapped. Use I() to define the colour otherwise it will think is just a variable.
pl <- ggplot(dfAll, aes(x = X, y = Y, group = Label)) +
geom_line(colour = I("red"))
pl + geom_point(data=dfDemo, aes(x = X, y = Y, colour = Z)) +
opts(legend.position = "none")
HTH

Resources