I am trying to add a linear regression model to my plot. I have this data frame:
watershed sqm cfs
3 deerfieldwatershed 1718617392 22703.8851
5 greenwatershed 233458430 1637.4895
6 northwatershed 240348182 3281.9921
8 southwatershed 68031782 867.6428
and my current code is:
ggplot(dischargevsarea, aes(x = sqm, y = cfs, color = watershed)) +
geom_point(aes(color = watershed), size = 2) +
labs(y= "Discharge (cfs)", x = "Area (sq. m)", color = "Watershed") +
scale_color_manual(values = c("#BAC4C1", "#37B795",
"#00898F", "#002245"),
labels = c("Deerfield", "Green", "North",
"South")) +
theme_minimal() +
geom_smooth(method = "lm", se = FALSE)
Which, when it runs, adds a line to the points in the legend, but does not show up on the graph (see image below). I suspect it is drawing a line individually for each point, but I want one regression line for all four points. How would I get the line I want to show up? Thanks.
You're right, it is because your points are grouped in different categories (because of the color in your first aes), so when you call geom_smooth, it will make a regression line for each categories and in your example, it means for each single point. So, that's why you don't have a single regression line.
To get a regression line for all points, you can pass the color argument only in the aes of geom_point (or you can use inherit.aes = FALSE in geom_smooth to indicate to ggplot to not consider previous mapping arguments and fill it with new arguments).
To display the equation on the graph (based on your question in comments), you can have the use of the stat_poly_eq function from the ggpmisc package (here a SO post describing its use: Add regression line equation and R^2 on graph):
library(ggplot2)
library(ggpmisc)
ggplot(df, aes(x = sqm, y = cfs)) +
labs(y= "Discharge (cfs)", x = "Area (sq. m)", color = "Watershed") +
scale_color_manual(values = c("#BAC4C1", "#37B795",
"#00898F", "#002245"),
labels = c("Deerfield", "Green", "North",
"South")) +
theme_minimal() +
geom_smooth(method = "lm", se = FALSE, formula = y~x)+
stat_poly_eq(formula = y~x, aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = TRUE)+
geom_point(aes(color = watershed))
Data
structure(list(watershed = c("deerfieldwatershed", "greenwatershed",
"northwatershed", "southwatershed"), sqm = c(1718617392L, 233458430L,
240348182L, 68031782L), cfs = c(22703.8851, 1637.4895, 3281.9921,
867.6428)), row.names = c(NA, -4L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x55ef09764350>)
Related
I have a data.frame with observed success/failure outcomes per two groups along with expected probabilities:
library(dplyr)
observed.probability.df <- data.frame(group = c("A","B"), p = c(0.4,0.6))
expected.probability.df <- data.frame(group = c("A","B"), p = qlogis(c(0.45,0.55)))
observed.data.df <- do.call(rbind,lapply(c("A","B"), function(g)
data.frame(group = g, value = c(rep(0,1000*dplyr::filter(observed.probability.df, group != g)$p),rep(1,1000*dplyr::filter(observed.probability.df, group == g)$p)))
)) %>% dplyr::left_join(expected.probability.df)
observed.probability.df$group <- factor(observed.probability.df$group, levels = c("A","B"))
observed.data.df$group <- factor(observed.data.df$group, levels = c("A","B"))
I'm fitting a logistic regression (binomial glm with a logit link function) to these data with the offset term:
fit <- glm(value ~ group + offset(p), data = observed.data.df, family = binomial(link = 'logit'))
Now, I'd like to plot these data as a bar graph using ggplot2's geom_bar, color-coded by group, and to add to that the trend line and shaded standard error area estimated in fit.
I'd use stat_smooth for that but I don't think it can handle the offset term in it's formula, so looks like I need to resort to assembling this figure in an alternative way.
To get the bars and the trend line I used:
slope.est <- function(x, ests) plogis(ests[1] + ests[2] * x)
library(ggplot2)
ggplot(observed.probability.df, aes(x = group, y = p, fill = group)) +
geom_bar(stat = 'identity') +
stat_function(fun = slope.est,args=list(ests=coef(fit)),size=2,color="black") +
scale_x_discrete(name = NULL,labels = levels(observed.probability.df$group), breaks = sort(unique(observed.probability.df$group))) +
theme_minimal() + theme(legend.title = element_blank()) + ylab("Fraction of cells")
So the question is how to add to that the shaded standard error around the trend line?
Using stat_function I am able to shade the entire area from the upper bound of the standard error all the way down to the X-axis:
ggplot(observed.probability.df, aes(x = group, y = p, fill = group)) +
geom_bar(stat = 'identity') +
stat_function(fun = slope.est,args=list(ests=coef(fit)),size=2,color="black") +
stat_function(fun = slope.est,args=list(ests=summary(fit)$coefficients[,1]+summary(fit)$coefficients[,2]),geom='area',fill="gray",alpha=0.25) +
scale_x_discrete(name = NULL,labels = levels(observed.probability.df$group), breaks = sort(unique(observed.probability.df$group))) +
theme_minimal() + theme(legend.title = element_blank()) + ylab("Fraction of cells")
Which is close but not quite there.
Any idea how to subtract from the shaded area above the area that's below the lower bound of the standard error? Perhaps geom_ribbon is the way to go here, but I don't know how to combine it with the slope.est function
I am developing a shiny app in which, I am generating scatterplots by uploading the data files in .txt format.
I doing a polynomial fit on the scatterplot. I want the plot to show R^2 value.
Here is my attempt:
#plot
g <- ggplot(data = df, aes_string(x = df$x, y = df$y)) + theme_bw() +
geom_point(colour = "blue", size = 0.1)+
geom_smooth(formula = y ~ poly(x,input$degree, raw = TRUE), method = "lm", color = "green3", level = 1, size = 0.5)+
stat_poly_eq(formula = y ~ poly(x,input$degree, raw = TRUE),aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), parse = TRUE)
ggplotly(g)
A slider is used to vary the degree of the polynomial function, its handle is input$degree
I used stat_poly_eq from the ggpmisc package to get the value of the R^2
But when I run this code, the R^2 value does not get reflected on the ggplot as a legend. Which, according to the examples that I have seen, should get reflected on the plot as the legend of the plot.
I really struggle to set the correct legend for a geom_point plot with loess regression, while there is 2 data set used
I got a data set, who is summarizing activity over a day, and then I plot on the same graph, all the activity per hours and per days recorded, plus a regression curve smoothed with a loess function, plus the mean of each hours for all the days.
To be more precise, here is an example of the first code, and the graph returned, without legend, which is exactly what I expected:
# first graph, which is given what I expected but with no legend
p <- ggplot(dat1, aes(x = Hour, y = value)) +
geom_point(color = "darkgray", size = 1) +
geom_point(data = dat2, mapping = aes(x = Hour, y = mean),
color = 20, size = 3) +
geom_smooth(method = "loess", span = 0.2, color = "red", fill = "blue")
and the graph (in grey there is all the data, per hours, per days. the red curve is the loess regression. The blue dots are the means for each hours):
When I tried to set the legend I failed to plot one with the explanation for both kind of dots (data in grey, mean in blue), and the loess curve (in red). See below some example of what I tried.
# second graph, which is given what I expected + the legend for the loess that
# I wanted but with not the dot legend
p <- ggplot(dat1, aes(x = Hour, y = value)) +
geom_point(color = "darkgray", size = 1) +
geom_point(data = dat2, mapping = aes(x = Hour, y = mean),
color = "blue", size = 3) +
geom_smooth(method = "loess", span = 0.2, aes(color = "red"), fill = "blue") +
scale_color_identity(name = "legend model", guide = "legend",
labels = "loess regression \n with confidence interval")
I obtained the good legend for the curve only
and another trial :
# I tried to combine both date set into a single one as following but it did not
# work at all and I really do not understand how the legends works in ggplot2
# compared to the normal plots
A <- rbind(dat1, dat2)
p <- ggplot(A, aes(x = Heure, y = value, color = variable)) +
geom_point(data = subset(A, variable == "data"), size = 1) +
geom_point(data = subset(A, variable == "Moy"), size = 3) +
geom_smooth(method = "loess", span = 0.2, aes(color = "red"), fill = "blue") +
scale_color_manual(name = "légende",
labels = c("Data", "Moy", "loess regression \n with confidence interval"),
values = c("darkgray", "royalblue", "red"))
It appears that all the legend settings are mixed together in a "weird" way, the is a grey dot covering by a grey line, and then the same in blue and in red (for the 3 labels). all got a background filled in blue:
If you need to label the mean, might need to be a bit creative, because it's not so easy to add legend manually in ggplot.
I simulate something that looks like your data below.
dat1 = data.frame(
Hour = rep(1:24,each=10),
value = c(rnorm(60,0,1),rnorm(60,2,1),rnorm(60,1,1),rnorm(60,-1,1))
)
# classify this as raw data
dat1$Data = "Raw"
# calculate mean like you did
dat2 <- dat1 %>% group_by(Hour) %>% summarise(value=mean(value))
# classify this as mean
dat2$Data = "Mean"
# combine the data frames
plotdat <- rbind(dat1,dat2)
# add a dummy variable, we'll use it later
plotdat$line = "Loess-Smooth"
We make the basic dot plot first:
ggplot(plotdat, aes(x = Hour, y = value,col=Data,size=Data)) +
geom_point() +
scale_color_manual(values=c("blue","darkgray"))+
scale_size_manual(values=c(3,1),guide=FALSE)
Note with the size, we set guide to FALSE so it will not appear. Now we add the loess smooth, one way to introduce the legend is to introduce a linetype, and since there's only one group, you will have just one variable:
ggplot(plotdat, aes(x = Hour, y = value,col=Data,size=Data)) +
geom_point() +
scale_color_manual(values=c("blue","darkgray"))+
scale_size_manual(values=c(3,1),guide=FALSE)+
geom_smooth(data=subset(plotdat,Data="Raw"),
aes(linetype=line),size=1,alpha=0.3,
method = "loess", span = 0.2, color = "red", fill = "blue")
I have troubles adding linear regression lines to my ggplots.
This is how it should look like:
This is how it currently looks like:
This is my code:
p <- ggplot(data = wage, aes(x = educ, y = lwage, colour = black,
cex = IQ, pch = married, alpha = 0.7)) + geom_jitter()
p1 <- p + facet_grid(urban~experclass) + geom_smooth(se=F,method="lm")
p1 + labs(x = "Education (year)", y = "Log Wage", shape = "Marital status",
colour = "Ethnicity") + guides(alpha = FALSE)
Is the position of my geom_smooth wrong? What I want is only one black regression line for each element of the plot - and not one by layer.
Furthermore what happens when I add a regression line is that the legend symbols change. Especially the IQ legend looks pretty weird. Is there something I did not consider here?
How it should look:
How it looks:
I can try to answer at least one part of your question - which is the part about plotting one regression line instead of two per panel. I don't have your data so I can't fully replicate your problem, but I think this will work.
The aesthetics in your original ggplot() call will be inherited by all the subsequent layers, including the geom_smooth.
What you seem to want is the color aesthetic (which happens to be a grouping identifier) to apply only to the jittered points and not to the line. So you can write your code like this:
p <- ggplot(data = wage, aes(x = educ, y = lwage,
cex = IQ, pch = married, alpha = 0.7)) +
geom_jitter()
p1 <- p + facet_grid(urban~experclass) +
geom_smooth(se=F,method="lm",
aes(colour = black))
or, alternatively, as one single ggplot call in a modified style:
p3 <- ggplot(data = wage,
aes(x = educ, y = lwage,
size = IQ, shape = married, alpha = 0.7)) +
geom_jitter() +
geom_smooth(se=F,method="lm",
aes(colour = black))+
facet_grid(urban~experclass)
p3
I'm using ggplot2 to show points colored by value. In addition, I want to show a regression line on this data.
This is an example of the data that I am using:
structure(list(a = c(63.635707116462, 59.7200565823145, 56.0311239027684,
53.1573088984712, 51.0192317467653, 48.0727441921859, 47.1516684444444,
45.5081981068289, 43.5874967485549, 43.3163255512322), b = c(278.983796321269,
254.833332215134, 234.812503036992, 221.519477352253, 212.013474843663,
199.926648466351, 194.577007436116, 186.506133515809, 179.411968705754,
172.056487287103), col = c(18.36245, 22.03494, 25.70743, 29.37992,
33.05241, 36.7249, 40.39739, 44.06988, 47.74237, 51.41486), predict = c(275.438415187452,
256.049214397717, 237.782656695549, 223.552332598712, 212.965175538386,
198.374997400175, 193.814089203754, 185.676086057123, 176.165312823424,
174.82254927815)), .Names = c("a", "b", "col", "predict"), row.names = c(NA,
-10L), class = "data.frame")
And the code I am using so far is as follows:
p <- ggplot(data = df, aes(x = a, y = b, colour=col)) + geom_point()
p + stat_smooth(method = "lm", formula = y ~ x, se = FALSE)
However, this does not produce a straight line (as it is smoothed) so instead I tried to follow one of the examples on ggplot2 (which is using qplot) and did the following:
model <- lm(b ~ a, data = df)
df$predict <- stats::predict(model, newdata=df)
p <- ggplot(data = df, aes(x = a, y = b, colour=col) ) + geom_point()
p + geom_line(aes(x = a, y = predict))
In the example, a line is added using + geom_line(data=grid), which in my case would be + geom_line(data=df). This just joins the points together, instead of drawing a straight line on the plot. How can I plot a line on this plot that is perfectly straight?
The other problem I was having with the plot, is renaming the legend. I want to have a two word title for the data (e.g. 'Z Density'), but I don't know how to change it. I've tried using + scale_colour_discrete(name = "Fancy Title") and + scale_linetype_discrete(name = "Fancy Title") using advice from this question but they do not work as my data is colored by a value.
As #Andrie says, using method = "lm" gives a linear model. As for your second question, use scale_color_continuous()
p <- ggplot(data = df, aes(x = a, y = b, colour=col)) + geom_point()
p + stat_smooth(method = "lm", se = FALSE) +
scale_colour_continuous(name = "My Legend")
You also don't need to do all of the predicting. ggplot() will do this for you, which is one of the great benefits.