ggplot2 and regression lines and R^2 values - r

I know there have been a number of entries with regards to adding R^2 values to plots, but I am having trouble following the codes. I am graphing a scatter plot with three categories. I have added a linear regression line for each one. I would now like to add r^2 values for each but I can't figure out how to do this.
My code:
veg <- read.csv("latandwtall2.csv", header=TRUE)
library("ggplot2")
a <- ggplot(veg, aes(x=avglat, y=wtfi, color=genus)) + geom_point(shape=19, size=4)
b <- a + scale_colour_hue(l=50) + stat_smooth(method = "lm", formula = y ~ x, size = 1, se = FALSE)
c <- b + labs(x="Latitude", y="Weight (g)")
d <- c + theme_bw()
e <- d + theme(panel.grid.minor=element_blank(), panel.grid.major=element_blank())
#changes size of text
f <- e + theme(
axis.title.x = element_text(color="black", vjust=-0.35, size=15, face="bold"),
axis.title.y = element_text(color="black" , vjust=0.35, size=15, face="bold")
)
g <- e+theme(legend.key=element_rect(fill='white'))
g
Any help with how to add R^2 values would be greatly appreciated. Thanks!

If you build a data frame with the r-squared values, you might be able to (mostly) automate the positioning of the annotation text by including it as a call to geom_text.
Here's a toy example. The rsq data frame is used in geom_text to place the r-squared labels. In this case, I've set it up to put the labels just after the highest x-value and the predict function gets the y-value. It's probably too much work for a single plot, but if you're doing this a lot, you can turn it into a function so that you don't have to repeat the set-up code every time, and maybe add some fancier logic to make label placement more flexible:
library(reshape2) # For melt function
# Fake data
set.seed(12)
x = runif(100, 0, 10)
dat = data.frame(x, y1 = 2*x + 3 + rnorm(100, 0, 5),
y2 = 4*x + 20 + rnorm(100, 0, 10))
dat.m = melt(dat, id.var="x")
# linear models
my1 = lm(y1 ~ x, data=dat)
my2 = lm(y2 ~ x, data=dat)
# Data frame for adding r-squared values to plot
rsq = data.frame(model=c("y1","y2"),
r2=c(summary(my1)$adj.r.squared,
summary(my2)$adj.r.squared),
x=max(dat$x),
y=c(predict(my1, newdata=data.frame(x=max(dat$x))),
predict(my2, newdata=data.frame(x=max(dat$x)))))
ggplot() +
geom_point(data=dat.m, aes(x, value, colour=variable)) +
geom_smooth(data=dat.m, aes(x, value, colour=variable),
method="lm", se=FALSE) +
geom_text(data=rsq, aes(label=paste("r^2 == ", round(r2,2)),
x=1.05*x, y=y, colour=model, hjust=0.5),
size=4.5, parse=TRUE)

I can't really reproduce what you're doing but you need to use annotate()
Something that could work (puting the R2 on the 10th point) would be :
R2 = 0.4
i = 10
text = paste("R-squared = ", R2, sep="")
g = g + annotate("text", x=avglat[i], y=wtfi[i], label=text, font="Calibri", colour="red", vjust = -2, hjust = 1)
Use vjust and hjust to adjust the position of the text to the point (change the i), and just fill the variable R2 with your computed rsquared. You can choose the point you like or manually enter the x,y coordinate it's up to you. Does that help ?
PS. I put extra parameters (font, colours) so that you have the flexibility to change them.

Build the model separately, get the R^2 from there, and add it to the plot. I'll give you some dummy code, but it would be of better quality if you had given us a sample data frame.
r2 = summary(lm(wtfi ~ avglat, data=veg))$r.squared
#to piggyback on Romain's code...
i=10
g = g + annotate("text", x=avglat[i], y=wtfi[i], label=round(r2,2), font="Calibri", colour="red", vjust = -2, hjust = 1)
The way I wrote it here you don't need to hard-code the R^2 value in.

Related

Creating a legend with shapes using ggplot2

I have created the following code for a graph in which four fitted lines and corresponding points are plotted. I have problems with the legend. For some reason I cannot find a way to assign the different shapes of the points to a variable name. Also, the colours do not line up with the actual colours in the graph.
y1 <- c(1400,1200,1100,1000,900,800)
y2 <- c(1300,1130,1020,970,830,820)
y3 <- c(1340,1230,1120,1070,940,850)
y4 <- c(1290,1150,1040,920,810,800)
df <- data.frame(x,y1,y2,y3,y4)
g <- ggplot(df, aes(x=x), shape="shape") +
geom_smooth(aes(y=y1), colour="red", method="auto", se=FALSE) + geom_point(aes(y=y1),shape=14) +
geom_smooth(aes(y=y2), colour="blue", method="auto", se=FALSE) + geom_point(aes(y=y2),shape=8) +
geom_smooth(aes(y=y3), colour="green", method="auto", se=FALSE) + geom_point(aes(y=y3),shape=6) +
geom_smooth(aes(y=y4), colour="yellow", method="auto", se=FALSE) + geom_point(aes(y=y4),shape=2) +
ylab("x") + xlab("y") + labs(title="overview")
geom_line(aes(y=1000), linetype = "dashed")
theme_light() +
theme(plot.title = element_text(color="black", size=12, face="italic", hjust = 0.5)) +
scale_shape_binned(name="Value g", values=c(y1="14",y2="8",y3="6",y4="2"))
print(g)
I am wondering why the colours don't match up and how I can construct such a legend that it is clear which shape corresponds to which variable name.
While you can add the legend manually via scale_shape_manual, perhaps the adequate solution would be to reshape your data (try using tidyr::pivot_longer() on y1:y4 variables), and then assigning the resulting variable to the shape aesthetic (you can then manually set the colors to your liking). You would then need to use a single geom_point() and geom_smooth() instead of four of each.
Also, you're missing a reproducible example (what are the values of x?) and your code emits some warnings while trying to perform loess smoothing (because there's fewer data points than need to perform it).
Update (2021-12-12)
Here's a reproducible example in which we reshape the original data and feed it to ggplot using its aes() function to automatically plot different geom_point and geom_smooth for each "y group". I made up the values for the x variable.
library(ggplot2)
library(tidyr)
x <- 1:6
y1 <- c(1400,1200,1100,1000,900,800)
y2 <- c(1300,1130,1020,970,830,820)
y3 <- c(1340,1230,1120,1070,940,850)
y4 <- c(1290,1150,1040,920,810,800)
df <- data.frame(x,y1,y2,y3,y4)
data2 <- df %>%
pivot_longer(y1:y4, names_to = "group", values_to = "y")
ggplot(data2, aes(x, y, color = group, shape = group)) +
geom_point(size = 3) + # increased size for increased visibility
geom_smooth(method = "auto", se = FALSE)
Run the code line by line in RStudio and use it to inspect data2. I think it'll make more sense here's the resulting output:
Another update
Freek19, in your second example you'll need to specify both the shape and color scales manually, so that ggplot2 considers them to be the same, like so:
library(ggplot2)
data <- ... # from your previous example
ggplot(data, aes(x, y, shape = group, color = group)) +
geom_smooth() +
geom_point(size = 3) +
scale_shape_manual("Program type", values=c(1, 2, 3,4,5)) +
scale_color_manual("Program type", values=c(1, 2, 3,4,5))
Hope this helps.
I managed to get close to what I want, using:
library(ggplot2)
data <- data.frame(x = c(0,0.02,0.04,0.06,0.08,0.1),
y = c(1400,1200,1100,1000,910,850, #y1
1300,1130,1010,970,890,840, #y2
1200,1080,980,950,880,820, #y3
1100,1050,960,930,830,810, #y4
1050,1000,950,920,810,800), #y5
group = rep(c("5%","6%","7%","8%","9%"), each = 6))
data
Values <- ggplot(data, aes(x, y, shape = group, color = group)) + # Create line plot with default colors
geom_smooth(aes(color=group)) + geom_point(aes(shape=group),size=3) +
scale_shape_manual(values=c(1, 2, 3,4,5))+
geom_line(aes(y=1000), linetype = "dashed") +
ylab("V(c)") + xlab("c") + labs(title="Valuation")+
theme_light() +
theme(plot.title = element_text(color="black", size=12, face="italic", hjust = 0.5))+
labs(group="Program Type")
Values
I am only stuck with 2 legends. I want to change both name, because otherwise they overlap. However I am not sure how to do this.

R: point at which geom_smooth drops below a certain value

Hi stack overflow community,
I hope the two interrelated questions I am asking are not too nooby. I tried several google searches but could not find a solution.
I use R to plot the findings of a linguistic "experiment", in which I checked in how far two grammatical constructions yield acceptable descriptions of an event, depending on how for it unfolds. My data look like similar to this:
event,PFV.alone,PFV.and.PART
0.01,0,1
0.01,0,1
0.05,0,1
0.05,0,1
0.05,0,1
0.1,0,1
0.1,0,1
0.25,0,1
0.25,0,1
0.25,0,1
0.3,0,1
0.3,0,1
0.33,0,1
0.33,0,1
0.33,0,1
0.33,0,1
....
0.67,1,0.5
0.75,1,0.5
0.75,1,0
0.75,1,0
0.75,1,0
0.8,1,1
0.8,1,0
0.8,1,0
0.8,1,0
0.85,1,1
0.85,1,0
0.9,1,0
0.9,1,0
0.9,1,0
0.95,1,0
As you can see, for each of the two constructions there are "plateaus" where acceptability is 0 or 1 and then there's a "transitional" area. In order to illustrate the "plateaus" I use geom_segment and to create a smooth "transition" for the scattered data in between, I use geom_smooth. Here's my code:
#after loading datafile into "Daten":
p <- ggplot(data = Daten,
aes(x=event, y=PFV.and.PART, xmin=0, ymin=0, xmax=1, ymax=1))
p + geom_blank() +
coord_fixed()+
xlab("Progress of the event") +
ylab("Acceptability") +
geom_segment(x=0, xend=1, y=0.5,yend=0.5, linetype="dotted") +
geom_smooth(data=(subset(Daten, event==0.33 | event ==0.9)),
aes(color="chocolate"),
method="loess", fullrange=FALSE, level=0.95, se=FALSE) +
geom_segment(x=0,xend=0.33,y=1,yend=1, color="chocolate", size=1) +
geom_segment(x=0.9,xend=1,y=0,yend=0, color="chocolate", size=1) +
geom_smooth(data=(subset(Daten, event==0.33 | event==0.67)),
aes(x = event, y = PFV.alone, color="cyan4"),
method="lm",fullrange=FALSE, level=0.95, se=FALSE) +
geom_segment(color="cyan4",x=0,xend=0.33,y=0,yend=0,size=1) +
geom_segment(color="cyan4", x=0.67,xend=1,y=1,yend=1, size=1) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(breaks = c (0,0.5,1), labels = scales::percent)+
labs(color='Construction')+
scale_color_manual(labels = c("PFV + PART", "PFV alone"),
values = c("chocolate", "cyan4")) +
theme(legend.position=c(0.05, 0.8),
legend.justification = c("left", "top"),
legend.background = element_rect(fill = "darkgray"))
This code produces a nice graph, but there's one calculation and one plot-related issue that I need help with.
First, and most importantly, I'd like to find out, at what point exactly the geom_smooth (loess) curve for "PFV.and.PART" drops down to 0.5, i.e. hits 50% acceptability. I fear that this might involve some quiet complex code?
Related to the preceding point, I'd like to mark area/line, where both curves are above 0.5 (50% acceptability), or to speak in terms of what I am trying to show: the percentages of the event at which both constructions yield a description that is at least 50% acceptable. This, of course would be based on point 1, as it is neceessary to determine the right limit, whereas the left limit does not constitute a problem as it seems to lie at x=0.5,y=0.5.
I'd really appreciate any help and I hope that I have provided all the necessary information. Please excuse me if this question has been addressed elsewhere.
Here's one approach, which involves fitting a loess model outside of ggplot
# Generate some data
set.seed(2019)
my_dat <- c(sample(c(1,0.5, 0),33, prob = c(0.85,0.15,0), replace = TRUE),
sample(c(1,0.5, 0),33, prob = c(0.1,0.7,0.1), replace = TRUE),
sample(c(1,0.5,0),34, prob = c(0,0.15,0.85), replace = TRUE))
df <- tibble(x = 1:100, y = my_dat)
# fit a loess model
m1 <- loess(y~x, data = df)
df <- df %>%
add_column(pred = predict(m1)) # predict using the loess model
# plot
df %>%
ggplot(aes(x,y))+
geom_point() +
geom_line(aes(y = pred))
# search for a value of x that gives a prediction of 0.5
f <- function(x) { 0.5 - predict(m1)[x]}
uniroot(f, interval = c(1, 100))
# $root
# [1] 53.99997

annotate r squared to ggplot by using facet_wrap

I just joined the community and looking forward to get some help for the data analysis for my master thesis.
At the moment I have the following problem:
I plotted 42 varieties with ggplot by using facet_wrap:
`ggplot(sumfvvar,aes(x=TemperaturCmean,y=Fv.Fm,col=treatment))+
geom_point(shape=1,size=1)+
geom_smooth(method=lm)+
scale_color_brewer(palette = "Set1")+
facet_wrap(.~Variety)`
That works very well, but I would like to annotate the r squared values for the regression lines. I have two treatments and 42 varieties, therefore 84 regression lines.
Are there any possibilties to calculate all r squared values and integrate them into the ggplot? I found allready the function
ggplotRegression <- function (fit) {
require(ggplot2)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}
but that works just for one variety and one treatment. Could be a loop for the lm() function an option?
Here is an example with the ggpmisc package:
library(ggpmisc)
set.seed(4321)
x <- 1:100
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 4)
my.data <- data.frame(x = x,
y = y,
group = c("A", "B"))
formula <- y ~ poly(x, 1, raw = TRUE)
ggplot(my.data, aes(x, y)) +
facet_wrap(~ group) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(formula = formula, parse = TRUE,
mapping = aes(label = stat(rr.label)))
You can't apply different labels to different facet, unless you add another r^2 column to your data.. One way is to use geom_text, but you need to calculate the stats you need first. Below I show an example with iris, and for your case, just change Species for Variety, and so on
library(tidyverse)
# simulate data for 2 treatments
# d2 is just shifted up from d1
d1 <- data.frame(iris,Treatment="A")
d2 <- data.frame(iris,Treatment="B") %>%
mutate(Sepal.Length=Sepal.Length+rnorm(nrow(iris),1,0.5))
# combine datasets
DF <- rbind(d1,d2) %>% rename(Variety = Species)
# plot like you did
# note I use "free" scales, if scales very different between Species
# your facet plots will be squished
g <- ggplot(DF,aes(x=Sepal.Width,y=Sepal.Length,col=Treatment))+
geom_point(shape=1,size=1)+
geom_smooth(method=lm)+
scale_color_brewer(palette = "Set1")+
facet_wrap(.~Variety,scales="free")
# rsq function
RSQ = function(y,x){signif(summary(lm(y ~ x))$adj.r.squared, 3)}
#calculate rsq for variety + treatment
STATS <- DF %>%
group_by(Variety,Treatment) %>%
summarise(Rsq=RSQ(Sepal.Length,Sepal.Width)) %>%
# make a label
# one other option is to use stringr::str_wrap in geom_text
mutate(Label=paste("Treat",Treatment,", Rsq=",Rsq))
# set vertical position of rsq
VJUST = ifelse(STATS$Treatment=="A",1.5,3)
# finally the plot function
g + geom_text(data=STATS,aes(x=-Inf,y=+Inf,label=Label),
hjust = -0.1, vjust = VJUST,size=3)
For the last geom_text() call, I allowed the y coordinates of the text to be different by multiplying the Treatment.. You might need to adjust that depending on your plot..

Error with ggplot2

I don't know what am I missing in the code?
set.seed(12345)
require(ggplot2)
AData <- data.frame(Glabel=LETTERS[1:7], A=rnorm(7, mean = 0, sd = 1), B=rnorm(7, mean = 0, sd = 1))
TData <- data.frame(Tlabel=LETTERS[11:20], A=rnorm(10, mean = 0, sd = 1), B=rnorm(10, mean = 0, sd = 1))
i <- 2
j <- 3
p <- ggplot(data=AData, aes(AData[, i], AData[, j])) + geom_point() + theme_bw()
p <- p + geom_text(aes(data=AData, label=Glabel), size=3, vjust=1.25, colour="black")
p <- p + geom_segment(data = TData, aes(xend = TData[ ,i], yend=TData[ ,j]),
x=0, y=0, colour="black",
arrow=arrow(angle=25, length=unit(0.25, "cm")))
p <- p + geom_text(data=TData, aes(label=Tlabel), size=3, vjust=1.35, colour="black")
Last line of the code produces the error. Please point me out how to figure out this problem. Thanks in advance.
I have no idea what you are trying to do, but the line that fails is the last line, because you haven't mapped new x and y variables in the mapping. geom_text() needs x and y coords but you only provide the label argument, so ggplot takes x and y from p, which has only 7 rows of data whilst Tlabel is of length 10. That explains the error. I presume you mean to plot at x = A and y = B of TData? If so, this works:
p + geom_text(data=TData, mapping = aes(A, B, label=Tlabel),
size=3, vjust=1.35, colour="black")
(This might get a better answer on the ggplot mailing list.)
It looks like you're trying to display some kind of biplot ... the root of your problem is that you're violating the idiom of ggplot, which wants you to specify variables in a way that's consistent with the scope of the data.
Maybe this does what you want, via some aes_string trickery that substitutes the names of the desired columns ...
varnames <- colnames(AData)[-1]
v1 <- varnames[1]
v2 <- varnames[2]
p <- ggplot(data=AData,
aes_string(x=v1, y=v2)) + geom_point() + theme_bw()
## took out redundant 'data', made size bigger so I could see the labels
p <- p + geom_text(aes(label=Glabel), size=7, vjust=1.25, colour="black")
p <- p + geom_segment(data = TData, aes_string(xend = v1, yend=v2),
x=0, y=0, colour="black",
arrow=arrow(angle=25, length=unit(0.25, "cm")))
## added colour so I could distinguish this second set of labels
p <- p + geom_text(data=TData,
aes(label=Tlabel), size=10, vjust=1.35, colour="blue")

geom_polygon to draw normal and logistic distributions

UPDATE:
I have solved my problem. I was looking for
coord_cartesian(xlim = c(800, 2100), ylim = c(0, 0.0021))
Thanks to every one who tried to help!
QUESTION WAS:
I would like to draw a nice picture of what is the difference between normal and logistic distributions. I have reached that point :
x=seq(1000,2000,length=200)
dat <- data.frame(
norm = dnorm(x,mean=1500,sd=200),
logistic = dlogis(x,location=1500,scale=200), x = x
)
ggplot(data=dat, aes(x=x)) +
geom_polygon(aes(y=norm), fill="red", alpha=0.6) +
geom_polygon(aes(y=logistic), fill="blue", alpha=0.6) +
xlab("") + ylab("") +
opts(title="Logistic and Normal Distributions") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0))
However the logistic one is "cut" at the bottom. I think what I should do is to draw this distribution from 0 to 3000 for example but show only 1000-2000.
Any clues how to do this?
I tried scale_x_continuous(limits = c(1000, 2000)) but this does not work
UPDATE:
I have updated my code so I have legend, now it looks like this:
x=seq(700,2300,length=200)
dat2 <- data.frame(x=x)
dat2$value <- dnorm(x,mean=1500,sd=200)
dat2$type <- "Normal"
dat1 <- data.frame(x=x)
dat1$value <- dlogis(x,location=1500,scale=200)
dat1$type <- "Logistic"
dat <- rbind(dat1, dat2)
ggplot(data=dat, aes(x=x, y=value, colour=type, fill=type)) + geom_polygon(alpha=0.6) + scale_y_continuous(expand = c(0, 0))
I would draw it using z-scores, from [-2 ; +2]. This has the side benefit that your problem goes away.
x=seq(-2,2,length=200)
dat <- data.frame(
norm = dnorm(x,mean=0,sd=0.2),
logistic = dlogis(x,location=0,scale=0.2), x = x
)
p <- ggplot(data=dat, aes(x=x)) +
geom_polygon(aes(y=norm), fill="red", alpha=0.6) +
geom_polygon(aes(y=logistic), fill="blue", alpha=0.6) +
xlab("z") + ylab("") +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
opts(title="Logistic and Normal Distributions")
print(p)
The reason it cuts off the bottom is because geom_polygon literally draws the polygon consisting of lines connecting the points you give it. So the flat line across the bottom of the distribution is just connecting the first and last value in your data frame. If you want it to extend to the bottom you can add the appropriate points to your data frame:
ggplot(data=dat, aes(x=x)) +
geom_polygon(aes(y=norm), fill="red", alpha=0.6) +
geom_polygon(data = rbind(c(NA,0,1000),dat,c(NA,0,2000)),aes(y=logistic), fill="blue", alpha=0.6) + xlab("") + ylab("") +
opts(title="Logistic and Normal Distributions")+
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0))
Edited for clarity
You can tinker with this to get it to go down only as far as you want by adding points with the right values. For instance, I forced the logistic distribution to fill all the way down to zero. You could make it level with the normal distribution by rbinding the minimum normal density value instead. Also, be careful where you add them in your data frame. geom_polygon will connect the dots in the order they appear. That's why I added one at the beginning of the data frame and one at the end.
Edit 2
Based on your revised code, my solution still works fine:
x=seq(700,2300,length=200)
dat2 <- data.frame(x=x)
dat2$value <- dnorm(x,mean=1500,sd=200)
dat2$type <- "Normal"
dat1 <- data.frame(x=x)
dat1$value <- dlogis(x,location=1500,scale=200)
dat1$type <- "Logistic"
#Append extra points at the top/bottom to
# complete the polygon
dat1 <- rbind(data.frame(x=700,value=0,type = "Logistic"),dat1,
data.frame(x=2300,value=0,type = "Logistic"))
dat <- rbind(dat1, dat2)
ggplot(data=dat, aes(x=x, y=value, colour=type, fill=type)) +
geom_polygon(alpha=0.6) +
scale_y_continuous(expand = c(0, 0))
And personally, I would prefer this over coord_cartesian, since I'm a stickler about starting my axes from zero.
The solution is to use
+ coord_cartesian(xlim = c(800, 2100), ylim = c(0, 0.0021))
I ran your code, and then analyzed the values of norm and logistic:
Rgames: mystat(dat$logistic)
min max mean median
3.51e-04 1.25e-03 8.46e-04 8.63e-04
sdev skew kurtosis
2.96e-04 -1.33e-01 -1.4
Rgames: mystat(dat$norm)
min max mean median
8.76e-05 1.99e-03 9.83e-04 9.06e-04
sdev skew kurtosis
6.62e-04 1.67e-01 -1.48
So your logistic values are in fact correctly plotted. As the other answers showed, there are preferable ways to create your underlying data.

Resources