Adding stats to a plot (ggplot2) - r

Given a data frame like :
df1= data.frame(x = c(1:50))
df1$val=df1$x*(-0.35)
I used the ggplot2 and added a regression line with the command
t=ggplot(df1, aes(x=val, y=x))+geom_smooth(method=lm) + geom_point()
In order to add the equation and the r value I tried the code from this question Adding Regression Line Equation and R2 on graph
but I am getting the error
Error in terms.formula(formula, data = data) :
'data' argument is of the wrong type
Any ideas on how to fix this?
EDIT
The code I used
my_sts <- function(df1){
m <- lm(df1$x ~ df1$val, df1);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3)))
as.character(as.expression(eq));
}
tgen = t + geom_text(x = -10, y = 50, label = eq(df1), parse = TRUE)

This is copied from a console session. I corrected two things that I thought were errors: 1) as mention in my comment you should not use df1$ in a formula when you have a data argument, and 2) I think you mean to use my_sts(df1)
> df1= data.frame(x = c(1:50))
> df1$val=df1$x*(-0.35)
> my_sts <- function(df1){
+ m <- lm(x ~ val, df1);
+ eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
+ list(a = format(coef(m)[1], digits = 2),
+ b = format(coef(m)[2], digits = 2),
+ r2 = format(summary(m)$r.squared, digits = 3)))
+ as.character(as.expression(eq));
+ }
> t=ggplot(df1, aes(x=val, y=x))+geom_smooth(method=lm) + geom_point()
> tgen = t + geom_text(x = -10, y = 50, label = eq(df1), parse = TRUE)
Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomText, :
could not find function "eq"
> tgen = t + geom_text(x = -10, y = 50, label = my_sts(df1), parse = TRUE)
Warning message:
In summary.lm(m) : essentially perfect fit: summary may be unreliable
> print(tgen)
Seems to print well: Note that x an y roles are reversed, hence the coefficient being the inverse of the modeled factor.

Related

Labelling R2 and p value in ggplot?

I am trying to add lm model coefs of two parallel modelling results onto the same ggplot plot. Here is my working example:
library(ggplot2)
set.seed(100)
dat <- data.frame(
x <- rnorm(100, 1),
y <- rnorm(100, 10),
lev <- gl(n = 2, k = 50, labels = letters[1:2])
)
mod1 <- lm(y~x, dat = dat[lev %in% "a", ])
r1 <- paste("R^2==", round(summary(mod1)[[9]], 3))
p1<- paste("p==", round(summary(mod1)[[4]][2, 4], 3), sep= "")
lab1 <- paste(r1, p1, sep =",")
mod2 <- lm(y~x, dat = dat[lev %in% "b", ])
r2 <- paste("R^2==", round(summary(mod2)[[9]], 3))
p2 <- paste("p==", round(summary(mod2)[[4]][2, 4], 3), sep= "")
lab2 <- paste(r2, p2, sep =",")
ggplot(dat, aes(x = x, y = y, col = lev)) + geom_jitter() + geom_smooth(method = "lm") + annotate("text", x = 2, y = 12, label = lab1, parse = T) + annotate("text", x = 10, y = 8, label = lab2, parse = T)
Here is the promot shows:
Error in parse(text = text[[i]]) : <text>:1:12: unexpected ','
1: R^2== 0.008,
Now the problem is that I could label either R2 or p value seperately, but not both of them together. How could I do to put the two results into one single line on the figure?
BTW, any other efficienty way of doing the same thing as my code? I have nine subplots that I want to put into one full plot, and I don't want to add them one by one.
++++++++++++++++++++++++++ Some update ++++++++++++++++++++++++++++++++++
Following #G. Grothendieck 's kind suggestion and idea, I tried to wrap the most repeatative part of the codes into a function, so I could finish all the plot with a few lines. Now the problem is that, whatever I changed the input variables, the output plot are basically the same, except the axis labels. Can anyone explain why? The following is the working code I used:
library(ggplot2)
library(ggpubr)
set.seed(100)
dat <- data.frame(
x = rnorm(100, 1),
y = rnorm(100, 10),
z = rnorm(100, 25),
lev = gl(n = 2, k = 50, labels = letters[1:2])
)
test <- function(dat, x, y){
fmt <- "%s: Adj ~ R^2 == %.3f * ',' ~ {p == %.3f}"
mod1 <- lm(y ~ x, dat, subset = lev == "a")
sum1 <- summary(mod1)
lab1 <- sprintf(fmt, "a", sum1$adj.r.squared, coef(sum1)[2, 4])
mod2 <- lm(y ~ x, dat, subset = lev == "b")
sum2 <- summary(mod2)
lab2 <- sprintf(fmt, "b", sum2$adj.r.squared, coef(sum2)[2, 4])
colors <- 1:2
p <- ggplot(dat, aes(x = x, y = y, col = lev)) +
geom_jitter() +
geom_smooth(method = "lm") +
annotate("text", x = 2, y = c(12, 8), label = c(lab1, lab2),
parse = TRUE, hjust = 0, color = colors) +
scale_color_manual(values = colors)
return(p)
}
ggarrange(test(dat, x, z), test(dat, y, z))
There are several problems here:
x, y and lev are arguments to data.frame so they must be specified using = rather than <-
make use of the subset= argument in lm
use sprintf instead of paste to simplify the specification of labels
label the text strings a and b and make them the same color as the corresponding lines to identify which is which
the formula syntax needs to be corrected. See fmt below.
it would be clearer to use component names and accessor functions of the summary objects where available
use TRUE rather than T because the latter can be overridden if there is a variable called T but TRUE can never be overridden.
use hjust=0 and adjust the x= and y= in annotate to align the two text strings
combine the annotate statements
place the individual terms of the ggplot statement on separate lines for improved readability
This gives:
library(ggplot2)
set.seed(100)
dat <- data.frame(
x = rnorm(100, 1),
y = rnorm(100, 10),
lev = gl(n = 2, k = 50, labels = letters[1:2])
)
fmt <- "%s: Adj ~ R^2 == %.3f * ',' ~ {p == %.3f}"
mod1 <- lm(y ~ x, dat, subset = lev == "a")
sum1 <- summary(mod1)
lab1 <- sprintf(fmt, "a", sum1$adj.r.squared, coef(sum1)[2, 4])
mod2 <- lm(y ~ x, dat, subset = lev == "b")
sum2 <- summary(mod2)
lab2 <- sprintf(fmt, "b", sum2$adj.r.squared, coef(sum2)[2, 4])
colors <- 1:2
ggplot(dat, aes(x = x, y = y, col = lev)) +
geom_jitter() +
geom_smooth(method = "lm") +
annotate("text", x = 2, y = c(12, 8), label = c(lab1, lab2),
parse = TRUE, hjust = 0, color = colors) +
scale_color_manual(values = colors)
Unless I'm misunderstanding your question, the problem's with the parse = T arguments to your annotate calls. I don't think your strings need to be parsed. Try parse = F instead, or just drop the parameter, as the default value seems to be FALSE anyway

Why the lm_eqn way of adding Regression line function did not work in my case?

I ran into a problem when I was writing a function in R.
I want to compare two variables in this function and I want to draw the regression line of the comparison. I would also want to add the information of the regression line, including the equation and the R^2. The lm_eqn way I have already tried and it did not work on my case, here is my code when I try it. I do not know why, please help!
lm_eqn <- function(df){
m <- lm(y ~ x, df);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3)))
as.character(as.expression(eq));
}
compareFunction <- function(my_dataset, var1, var2) {
ggplot(data = my_dataset,
aes(x = my_dataset[[var1]],
y = my_dataset[[var2]])) +
geom_point() +
geom_smooth(method = 'lm', formula = 'y ~ x') +
geom_text(x = 100, y = 100, label = lm_eqn(my_dataset), parse = TRUE)
}
Ok it becomes a bit tedious in the comment.
So first I recommend adding some useful sample data:
x = runif(100)
y = runif(100)+x
df = data.frame(x,y)
Then update your lm_eqn function as follows - I removed the as.character from your return value.
lm_eqn <- function(df){
m <- lm(y ~ x, df);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3)))
as.expression(eq);
}
The compareFunction I would change to use ggtitle:
compareFunction <- function(my_dataset, var1, var2) {
ggplot(data = my_dataset,
aes(x = my_dataset[[var1]],
y = my_dataset[[var2]])) +
geom_point() +
geom_smooth(method = 'lm', formula = 'y ~ x') +
ggtitle(lm_eqn(my_dataset))
}
Then compareFunction(df,"x","y") yields:

R, R², p-value and regression equation

This code gives me a plot with the regression equation and R2: (but i need to mention in which x and y the equation will be (manually)
CORRELATIONP3 <-CORRELATIONP2[product=='a',]
x<-CORRELATIONP3$b
y<-CORRELATIONP3$p
df <- data.frame(x = x)
m <- lm(y ~ x, data = df)
p <- ggplot(data = df, aes(x = x, y = y)) +
scale_x_continuous("b (%)") +
scale_y_continuous("p (%)")+
geom_smooth(method = "lm", formula = y ~ x) +
geom_point()
p
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list( a = format(coef(m)[1], digits = 4),
b = format(coef(m)[2], digits = 4),
r2 = format(summary(m)$r.squared, digits = 3)))
dftext <- data.frame(x = 3, y = 0.2, eq = as.character(as.expression(eq)))
p + geom_text(aes(label = eq), data = dftext, parse = TRUE)
But, with this code I have R and p-value: And here the information about R and p values fits automatically in the plot, why? I want this in the first one as well.
CORRELATIONP3 <-CORRELATIONP2[product=='a',]
x<-CORRELATIONP3$b
y<-CORRELATIONP3$p
df <- data.frame(x = x)
m <- lm(y ~ x, data = df)
p <- ggplot(data = df, aes(x = x, y = y)) +
scale_x_continuous("b (%)") +
scale_y_continuous("p (%)")+
geom_smooth(method = "lm", formula = y ~ x) +
geom_point()
p
eq <- substitute(italic(r)~"="~rvalue*","~italic(p)~"="~pvalue, list(rvalue = sprintf("%.2f",sign(coef(m)[2])*sqrt(summary(m)$r.squared)), pvalue = format(summary(m)$coefficients[2,4], digits = 3)))
dftext <- data.frame(x = 30, y = 0.4, eq = as.character(as.expression(eq)))
p + geom_text(aes(label = eq), data = dftext, parse = TRUE)
Can you tell me how can I join all the 4 informations in one sigle plot? (R, R2, equation and p-value)
Besides that, i would like that these informations could be fitted automatically in the plot, not manually.
Ok, I am not sure if this works as you have not given a reproducible example of your data but I guess you just have to rename one of your variables e.g.:
eq2 <- substitute(italic(r)~"="~rvalue*","~italic(p)~"="~pvalue,
list(rvalue = sprintf("%.2f",sign(coef(m)[2])*sqrt(summary(m)$r.squared)),
pvalue = format(summary(m)$coefficients[2,4], digits = 3)))
and then you change the points you put it on in your plot just a bit below your other block in the first plot. x and y here refer to the position of the text lable so play around with these until your text looks ok.
dftext2 <- data.frame(x = 30, y = 0.12, eq2 = as.character(as.expression(eq2)))
p + geom_text(aes(label = eq2), data = dftext2, parse = TRUE)
please let me know if this works and if this is what you meant.

How to automate linear regression of multiple rows in a loop and plot using R

I am working with 2 data frames and trying to automate the way I currently do.
ID <- c("ID101","ID102","ID103","ID104","ID105","ID106","ID107","ID108","ID109","ID110")
A <- c(420,440,490,413,446,466,454,433,401,414)
B <- c(230,240,295,253,266,286,254,233,201,214)
C <- c(20,40,90,13,46,66,54,33,61,14)
D <- c(120,140,190,113,146,166,154,133,101,114)
E <- c(38,34,33,56,87,31,12,44,68,91)
F <- c(938,934,973,956,987,931,962,944,918,921)
df1 <- data.frame(ID,A,B,C,D,E,F)
Upstream <- c("A","C","E")
Downstream <- c("B","D","F")
df2 <- data.frame(Upstream,Downstream)
I am currently running a simple linear regression between upstream and downstream data and plot its residuals along with it. The way I do it manually is
fit <- lm(A ~ B, data=df)
lm_eqn <- function(df){
m <- lm(A ~ B, df);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(R)^2~"="~r2* "," ~~ RMSE ~"="~rmse,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3),
rmse = round(sqrt(mean(resid(m)^2,na.rm=TRUE)), 3)))
as.character(as.expression(eq));
}
library(ggplot2)
library(grid)
library(gridExtra)
p1 <- ggplot(df, aes(x=A, y=B)) + geom_point(colour="red",size = 3) + geom_smooth(method=lm) + geom_text(aes(size=10),x = -Inf, hjust = -1, y = Inf, vjust = 1, label = lm_eqn(df), parse = TRUE,show.legend = F)
p2 <- ggplot(df, aes(x=B, y=resid(fit))) + ylab("Residuals") + geom_point(shape=1,colour="red",size = 3) + geom_smooth(method = "lm")
grid.arrange(p1, p2, ncol=2,top=textGrob("Regression data",
gp=gpar(cex=1.5, fontface="bold")))
I get this plot
I redo this manually for the next row in df2 which is C & D and then manually change the parameters again for the next row which is E & F.
How can I use functions or automate this logic so that I run only one time and get the 3 plots, one for each (A&B), (C&D), (E&F).
Please let me know if I am not clear on what I want. Ideally I am looking a way to code up so that I don't manually need to enter the values (A,B,C,D,E,F) at the respective places every time I run. Kindly please provide some directions on how to solve this.
You can use apply() on each df2s row, using as.formula() and aes_string():
apply(df2, 1, function(d)
{
fit <- lm(as.formula(paste(d["Upstream"], " ~ ", d["Downstream"])), data=df1)
lm_eqn <- function(df){
m <- lm(as.formula(paste(d["Upstream"], " ~ ", d["Downstream"])), df);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(R)^2~"="~r2* "," ~~ RMSE ~"="~rmse,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3),
rmse = round(sqrt(mean(resid(m)^2,na.rm=TRUE)), 3)))
as.character(as.expression(eq));
}
library(ggplot2)
library(grid)
library(gridExtra)
p1 <- ggplot(df1, aes_string(x=d["Upstream"], y=d["Downstream"])) + geom_point(colour="red",size = 3) + geom_smooth(method=lm) + geom_text(aes(size=10),x = -Inf, hjust = -1, y = Inf, vjust = 1, label = lm_eqn(df1), parse = TRUE,show.legend = FALSE)
p2 <- ggplot(df1, aes_string(x=d["Downstream"], y=resid(fit))) + ylab("Residuals") + geom_point(shape=1,colour="red",size = 3) + geom_smooth(method = "lm")
grid.arrange(p1, p2, ncol=2,top=textGrob("Regression data",
gp=gpar(cex=1.5, fontface="bold")))
})
Another answer using reshape2 to organize your data and plyr to run the regression and plots on the data subsets:
library(reshape2)
df3 <- cbind(
#melt(data, id.vars, measure.vars, variable.name, value.name)
melt(df1, c("ID"), df2$Upstream, "up", "Independent.var"),
melt(df1, c("ID"), df2$Downstream, "down", "Dependent.var")
)
#df3 #Results of the above cbind(melt, melt)
# ID up value ID down Dependent.var
#1 ID101 A 420 ID101 B 230
#2 ID102 A 440 ID102 B 240
#3 ID103 A 490 ID103 B 295
# . . . . . .
#28 ID108 E 44 ID108 F 944
#29 ID109 E 68 ID109 F 918
#30 ID110 E 91 ID110 F 921
#Small edit to the labeling function:
lm_eqn <- function(df){
m <- lm(Dependent.var ~ Independent.var, df); #This is the only change
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(R)^2~"="~r2* "," ~~ RMSE ~"="~rmse,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3),
rmse = round(sqrt(mean(resid(m)^2,na.rm=TRUE)), 3)))
as.character(as.expression(eq));
}
# Put your plot code into a function.
plotter <- function(zz) {
zz$resid <- resid(lm(Dependent.var ~ Independent.var, zz))
p1 <- ggplot(zz, aes(x= Independent.var, y= Dependent.var)) +
geom_point(colour="red",size = 3) + geom_smooth(method=lm) +
geom_text( aes(size=10),x = -Inf, hjust = -2, y = Inf, vjust = 1, label = lm_eqn(zz), parse = TRUE,show.legend = FALSE)
p2 <- ggplot(zz, aes(x= Dependent.var, y=resid )) + ylab("Residuals") +
geom_point(shape=1,colour="red",size = 3) + geom_smooth(method = "lm")
p3 <- grid.arrange(p1, p2, ncol=2,top=textGrob("Regression data",
gp=gpar(cex=1.5, fontface="bold")))
#Choose what you want to output here. You can output a list: ex. list(fit, graph, etc)
p3 #Only the last plot is returned in this case
}
library(plyr)
#Run on every subset of data:
#dlply = take input (d)ataframe and output a (l)ist using (ply)r
dlply(df3, .variables = c("up"), .fun = plotter)
#p3 is output in list format

ggplot in R: add regression equation in a plot

I saw this answer from Jayden a while ago about adding regression equation to a plot, which I found very useful. But I don't want to display R^2, so I changed the code a bit to this:
lm_eqn = function(m) {
l <- list(a = format(coef(m)[1], digits = 2),
b = format(abs(coef(m)[2]), digits = 2));
if (coef(m)[2] >= 0) {
eq <- substitute(italic(y) == a + b %.% italic(x))
} else {
eq <- substitute(italic(y) == a - b %.% italic(x))
}
as.character(as.expression(eq));
}
This managed to plot "a+bx" or "a-bx" to the plot, but without actual coefficients replacing a and b. Does anyone know how to fix the problem? Thanks very much!
Jayden's answer:
lm_eqn = function(m) {
l <- list(a = format(coef(m)[1], digits = 2),
b = format(abs(coef(m)[2]), digits = 2),
r2 = format(summary(m)$r.squared, digits = 3));
if (coef(m)[2] >= 0) {
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
} else {
eq <- substitute(italic(y) == a - b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
}
as.character(as.expression(eq));
}
It looks like you are missing the l in substitute(). That is, use substitute(yourFormula, l). Here's a MWE without the r^2 that parallels the one you're looking at (which I think is at Adding Regression Line Equation and R2 on graph).
library(ggplot2)
# Function to generate correlated data.
GenCorrData = function(mu, Sig, n = 1000) {
U <- chol(Sig)
Z <- matrix(rnorm(n*length(mu)), nrow = length(mu))
Y <- crossprod(U,Z) + mu
Y <- as.data.frame(t(Y))
names(Y) <- c("x", "y")
return(Y)
}
# Function to add text
LinEqn = function(m) {
l <- list(a = format(coef(m)[1], digits = 2),
b = format(abs(coef(m)[2]), digits = 2));
if (coef(m)[2] >= 0) {
eq <- substitute(italic(y) == a + b %.% italic(x),l)
} else {
eq <- substitute(italic(y) == a - b %.% italic(x),l)
}
as.character(as.expression(eq));
}
# Example
set.seed(700)
n1 <- 1000
mu1 <- c(4, 5)
Sig1 <- matrix(c(1, .8, .8, 1), nrow = length(mu1))
df1 <- GenCorrData(mu1, Sig1, n1)
scatter1 <- ggplot(data = df1, aes(x, y)) +
geom_point(shape = 21, color = "blue", size = 3.5) +
scale_x_continuous(expand = c(0, 0), limits = c(0, 8)) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 8))
scatter.line1 <- scatter1 +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE,
color="black", size = 1) +
annotate("text", x = 2, y = 7, color = "black", size = 5,
label = LinEqn(lm(y ~ x, df1)), parse = TRUE)
scatter.line1

Resources