Related
I am trying to add lm model coefs of two parallel modelling results onto the same ggplot plot. Here is my working example:
library(ggplot2)
set.seed(100)
dat <- data.frame(
x <- rnorm(100, 1),
y <- rnorm(100, 10),
lev <- gl(n = 2, k = 50, labels = letters[1:2])
)
mod1 <- lm(y~x, dat = dat[lev %in% "a", ])
r1 <- paste("R^2==", round(summary(mod1)[[9]], 3))
p1<- paste("p==", round(summary(mod1)[[4]][2, 4], 3), sep= "")
lab1 <- paste(r1, p1, sep =",")
mod2 <- lm(y~x, dat = dat[lev %in% "b", ])
r2 <- paste("R^2==", round(summary(mod2)[[9]], 3))
p2 <- paste("p==", round(summary(mod2)[[4]][2, 4], 3), sep= "")
lab2 <- paste(r2, p2, sep =",")
ggplot(dat, aes(x = x, y = y, col = lev)) + geom_jitter() + geom_smooth(method = "lm") + annotate("text", x = 2, y = 12, label = lab1, parse = T) + annotate("text", x = 10, y = 8, label = lab2, parse = T)
Here is the promot shows:
Error in parse(text = text[[i]]) : <text>:1:12: unexpected ','
1: R^2== 0.008,
Now the problem is that I could label either R2 or p value seperately, but not both of them together. How could I do to put the two results into one single line on the figure?
BTW, any other efficienty way of doing the same thing as my code? I have nine subplots that I want to put into one full plot, and I don't want to add them one by one.
++++++++++++++++++++++++++ Some update ++++++++++++++++++++++++++++++++++
Following #G. Grothendieck 's kind suggestion and idea, I tried to wrap the most repeatative part of the codes into a function, so I could finish all the plot with a few lines. Now the problem is that, whatever I changed the input variables, the output plot are basically the same, except the axis labels. Can anyone explain why? The following is the working code I used:
library(ggplot2)
library(ggpubr)
set.seed(100)
dat <- data.frame(
x = rnorm(100, 1),
y = rnorm(100, 10),
z = rnorm(100, 25),
lev = gl(n = 2, k = 50, labels = letters[1:2])
)
test <- function(dat, x, y){
fmt <- "%s: Adj ~ R^2 == %.3f * ',' ~ {p == %.3f}"
mod1 <- lm(y ~ x, dat, subset = lev == "a")
sum1 <- summary(mod1)
lab1 <- sprintf(fmt, "a", sum1$adj.r.squared, coef(sum1)[2, 4])
mod2 <- lm(y ~ x, dat, subset = lev == "b")
sum2 <- summary(mod2)
lab2 <- sprintf(fmt, "b", sum2$adj.r.squared, coef(sum2)[2, 4])
colors <- 1:2
p <- ggplot(dat, aes(x = x, y = y, col = lev)) +
geom_jitter() +
geom_smooth(method = "lm") +
annotate("text", x = 2, y = c(12, 8), label = c(lab1, lab2),
parse = TRUE, hjust = 0, color = colors) +
scale_color_manual(values = colors)
return(p)
}
ggarrange(test(dat, x, z), test(dat, y, z))
There are several problems here:
x, y and lev are arguments to data.frame so they must be specified using = rather than <-
make use of the subset= argument in lm
use sprintf instead of paste to simplify the specification of labels
label the text strings a and b and make them the same color as the corresponding lines to identify which is which
the formula syntax needs to be corrected. See fmt below.
it would be clearer to use component names and accessor functions of the summary objects where available
use TRUE rather than T because the latter can be overridden if there is a variable called T but TRUE can never be overridden.
use hjust=0 and adjust the x= and y= in annotate to align the two text strings
combine the annotate statements
place the individual terms of the ggplot statement on separate lines for improved readability
This gives:
library(ggplot2)
set.seed(100)
dat <- data.frame(
x = rnorm(100, 1),
y = rnorm(100, 10),
lev = gl(n = 2, k = 50, labels = letters[1:2])
)
fmt <- "%s: Adj ~ R^2 == %.3f * ',' ~ {p == %.3f}"
mod1 <- lm(y ~ x, dat, subset = lev == "a")
sum1 <- summary(mod1)
lab1 <- sprintf(fmt, "a", sum1$adj.r.squared, coef(sum1)[2, 4])
mod2 <- lm(y ~ x, dat, subset = lev == "b")
sum2 <- summary(mod2)
lab2 <- sprintf(fmt, "b", sum2$adj.r.squared, coef(sum2)[2, 4])
colors <- 1:2
ggplot(dat, aes(x = x, y = y, col = lev)) +
geom_jitter() +
geom_smooth(method = "lm") +
annotate("text", x = 2, y = c(12, 8), label = c(lab1, lab2),
parse = TRUE, hjust = 0, color = colors) +
scale_color_manual(values = colors)
Unless I'm misunderstanding your question, the problem's with the parse = T arguments to your annotate calls. I don't think your strings need to be parsed. Try parse = F instead, or just drop the parameter, as the default value seems to be FALSE anyway
I ran into a problem when I was writing a function in R.
I want to compare two variables in this function and I want to draw the regression line of the comparison. I would also want to add the information of the regression line, including the equation and the R^2. The lm_eqn way I have already tried and it did not work on my case, here is my code when I try it. I do not know why, please help!
lm_eqn <- function(df){
m <- lm(y ~ x, df);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3)))
as.character(as.expression(eq));
}
compareFunction <- function(my_dataset, var1, var2) {
ggplot(data = my_dataset,
aes(x = my_dataset[[var1]],
y = my_dataset[[var2]])) +
geom_point() +
geom_smooth(method = 'lm', formula = 'y ~ x') +
geom_text(x = 100, y = 100, label = lm_eqn(my_dataset), parse = TRUE)
}
Ok it becomes a bit tedious in the comment.
So first I recommend adding some useful sample data:
x = runif(100)
y = runif(100)+x
df = data.frame(x,y)
Then update your lm_eqn function as follows - I removed the as.character from your return value.
lm_eqn <- function(df){
m <- lm(y ~ x, df);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3)))
as.expression(eq);
}
The compareFunction I would change to use ggtitle:
compareFunction <- function(my_dataset, var1, var2) {
ggplot(data = my_dataset,
aes(x = my_dataset[[var1]],
y = my_dataset[[var2]])) +
geom_point() +
geom_smooth(method = 'lm', formula = 'y ~ x') +
ggtitle(lm_eqn(my_dataset))
}
Then compareFunction(df,"x","y") yields:
This code gives me a plot with the regression equation and R2: (but i need to mention in which x and y the equation will be (manually)
CORRELATIONP3 <-CORRELATIONP2[product=='a',]
x<-CORRELATIONP3$b
y<-CORRELATIONP3$p
df <- data.frame(x = x)
m <- lm(y ~ x, data = df)
p <- ggplot(data = df, aes(x = x, y = y)) +
scale_x_continuous("b (%)") +
scale_y_continuous("p (%)")+
geom_smooth(method = "lm", formula = y ~ x) +
geom_point()
p
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list( a = format(coef(m)[1], digits = 4),
b = format(coef(m)[2], digits = 4),
r2 = format(summary(m)$r.squared, digits = 3)))
dftext <- data.frame(x = 3, y = 0.2, eq = as.character(as.expression(eq)))
p + geom_text(aes(label = eq), data = dftext, parse = TRUE)
But, with this code I have R and p-value: And here the information about R and p values fits automatically in the plot, why? I want this in the first one as well.
CORRELATIONP3 <-CORRELATIONP2[product=='a',]
x<-CORRELATIONP3$b
y<-CORRELATIONP3$p
df <- data.frame(x = x)
m <- lm(y ~ x, data = df)
p <- ggplot(data = df, aes(x = x, y = y)) +
scale_x_continuous("b (%)") +
scale_y_continuous("p (%)")+
geom_smooth(method = "lm", formula = y ~ x) +
geom_point()
p
eq <- substitute(italic(r)~"="~rvalue*","~italic(p)~"="~pvalue, list(rvalue = sprintf("%.2f",sign(coef(m)[2])*sqrt(summary(m)$r.squared)), pvalue = format(summary(m)$coefficients[2,4], digits = 3)))
dftext <- data.frame(x = 30, y = 0.4, eq = as.character(as.expression(eq)))
p + geom_text(aes(label = eq), data = dftext, parse = TRUE)
Can you tell me how can I join all the 4 informations in one sigle plot? (R, R2, equation and p-value)
Besides that, i would like that these informations could be fitted automatically in the plot, not manually.
Ok, I am not sure if this works as you have not given a reproducible example of your data but I guess you just have to rename one of your variables e.g.:
eq2 <- substitute(italic(r)~"="~rvalue*","~italic(p)~"="~pvalue,
list(rvalue = sprintf("%.2f",sign(coef(m)[2])*sqrt(summary(m)$r.squared)),
pvalue = format(summary(m)$coefficients[2,4], digits = 3)))
and then you change the points you put it on in your plot just a bit below your other block in the first plot. x and y here refer to the position of the text lable so play around with these until your text looks ok.
dftext2 <- data.frame(x = 30, y = 0.12, eq2 = as.character(as.expression(eq2)))
p + geom_text(aes(label = eq2), data = dftext2, parse = TRUE)
please let me know if this works and if this is what you meant.
I am working with 2 data frames and trying to automate the way I currently do.
ID <- c("ID101","ID102","ID103","ID104","ID105","ID106","ID107","ID108","ID109","ID110")
A <- c(420,440,490,413,446,466,454,433,401,414)
B <- c(230,240,295,253,266,286,254,233,201,214)
C <- c(20,40,90,13,46,66,54,33,61,14)
D <- c(120,140,190,113,146,166,154,133,101,114)
E <- c(38,34,33,56,87,31,12,44,68,91)
F <- c(938,934,973,956,987,931,962,944,918,921)
df1 <- data.frame(ID,A,B,C,D,E,F)
Upstream <- c("A","C","E")
Downstream <- c("B","D","F")
df2 <- data.frame(Upstream,Downstream)
I am currently running a simple linear regression between upstream and downstream data and plot its residuals along with it. The way I do it manually is
fit <- lm(A ~ B, data=df)
lm_eqn <- function(df){
m <- lm(A ~ B, df);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(R)^2~"="~r2* "," ~~ RMSE ~"="~rmse,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3),
rmse = round(sqrt(mean(resid(m)^2,na.rm=TRUE)), 3)))
as.character(as.expression(eq));
}
library(ggplot2)
library(grid)
library(gridExtra)
p1 <- ggplot(df, aes(x=A, y=B)) + geom_point(colour="red",size = 3) + geom_smooth(method=lm) + geom_text(aes(size=10),x = -Inf, hjust = -1, y = Inf, vjust = 1, label = lm_eqn(df), parse = TRUE,show.legend = F)
p2 <- ggplot(df, aes(x=B, y=resid(fit))) + ylab("Residuals") + geom_point(shape=1,colour="red",size = 3) + geom_smooth(method = "lm")
grid.arrange(p1, p2, ncol=2,top=textGrob("Regression data",
gp=gpar(cex=1.5, fontface="bold")))
I get this plot
I redo this manually for the next row in df2 which is C & D and then manually change the parameters again for the next row which is E & F.
How can I use functions or automate this logic so that I run only one time and get the 3 plots, one for each (A&B), (C&D), (E&F).
Please let me know if I am not clear on what I want. Ideally I am looking a way to code up so that I don't manually need to enter the values (A,B,C,D,E,F) at the respective places every time I run. Kindly please provide some directions on how to solve this.
You can use apply() on each df2s row, using as.formula() and aes_string():
apply(df2, 1, function(d)
{
fit <- lm(as.formula(paste(d["Upstream"], " ~ ", d["Downstream"])), data=df1)
lm_eqn <- function(df){
m <- lm(as.formula(paste(d["Upstream"], " ~ ", d["Downstream"])), df);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(R)^2~"="~r2* "," ~~ RMSE ~"="~rmse,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3),
rmse = round(sqrt(mean(resid(m)^2,na.rm=TRUE)), 3)))
as.character(as.expression(eq));
}
library(ggplot2)
library(grid)
library(gridExtra)
p1 <- ggplot(df1, aes_string(x=d["Upstream"], y=d["Downstream"])) + geom_point(colour="red",size = 3) + geom_smooth(method=lm) + geom_text(aes(size=10),x = -Inf, hjust = -1, y = Inf, vjust = 1, label = lm_eqn(df1), parse = TRUE,show.legend = FALSE)
p2 <- ggplot(df1, aes_string(x=d["Downstream"], y=resid(fit))) + ylab("Residuals") + geom_point(shape=1,colour="red",size = 3) + geom_smooth(method = "lm")
grid.arrange(p1, p2, ncol=2,top=textGrob("Regression data",
gp=gpar(cex=1.5, fontface="bold")))
})
Another answer using reshape2 to organize your data and plyr to run the regression and plots on the data subsets:
library(reshape2)
df3 <- cbind(
#melt(data, id.vars, measure.vars, variable.name, value.name)
melt(df1, c("ID"), df2$Upstream, "up", "Independent.var"),
melt(df1, c("ID"), df2$Downstream, "down", "Dependent.var")
)
#df3 #Results of the above cbind(melt, melt)
# ID up value ID down Dependent.var
#1 ID101 A 420 ID101 B 230
#2 ID102 A 440 ID102 B 240
#3 ID103 A 490 ID103 B 295
# . . . . . .
#28 ID108 E 44 ID108 F 944
#29 ID109 E 68 ID109 F 918
#30 ID110 E 91 ID110 F 921
#Small edit to the labeling function:
lm_eqn <- function(df){
m <- lm(Dependent.var ~ Independent.var, df); #This is the only change
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(R)^2~"="~r2* "," ~~ RMSE ~"="~rmse,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3),
rmse = round(sqrt(mean(resid(m)^2,na.rm=TRUE)), 3)))
as.character(as.expression(eq));
}
# Put your plot code into a function.
plotter <- function(zz) {
zz$resid <- resid(lm(Dependent.var ~ Independent.var, zz))
p1 <- ggplot(zz, aes(x= Independent.var, y= Dependent.var)) +
geom_point(colour="red",size = 3) + geom_smooth(method=lm) +
geom_text( aes(size=10),x = -Inf, hjust = -2, y = Inf, vjust = 1, label = lm_eqn(zz), parse = TRUE,show.legend = FALSE)
p2 <- ggplot(zz, aes(x= Dependent.var, y=resid )) + ylab("Residuals") +
geom_point(shape=1,colour="red",size = 3) + geom_smooth(method = "lm")
p3 <- grid.arrange(p1, p2, ncol=2,top=textGrob("Regression data",
gp=gpar(cex=1.5, fontface="bold")))
#Choose what you want to output here. You can output a list: ex. list(fit, graph, etc)
p3 #Only the last plot is returned in this case
}
library(plyr)
#Run on every subset of data:
#dlply = take input (d)ataframe and output a (l)ist using (ply)r
dlply(df3, .variables = c("up"), .fun = plotter)
#p3 is output in list format
I saw this answer from Jayden a while ago about adding regression equation to a plot, which I found very useful. But I don't want to display R^2, so I changed the code a bit to this:
lm_eqn = function(m) {
l <- list(a = format(coef(m)[1], digits = 2),
b = format(abs(coef(m)[2]), digits = 2));
if (coef(m)[2] >= 0) {
eq <- substitute(italic(y) == a + b %.% italic(x))
} else {
eq <- substitute(italic(y) == a - b %.% italic(x))
}
as.character(as.expression(eq));
}
This managed to plot "a+bx" or "a-bx" to the plot, but without actual coefficients replacing a and b. Does anyone know how to fix the problem? Thanks very much!
Jayden's answer:
lm_eqn = function(m) {
l <- list(a = format(coef(m)[1], digits = 2),
b = format(abs(coef(m)[2]), digits = 2),
r2 = format(summary(m)$r.squared, digits = 3));
if (coef(m)[2] >= 0) {
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
} else {
eq <- substitute(italic(y) == a - b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
}
as.character(as.expression(eq));
}
It looks like you are missing the l in substitute(). That is, use substitute(yourFormula, l). Here's a MWE without the r^2 that parallels the one you're looking at (which I think is at Adding Regression Line Equation and R2 on graph).
library(ggplot2)
# Function to generate correlated data.
GenCorrData = function(mu, Sig, n = 1000) {
U <- chol(Sig)
Z <- matrix(rnorm(n*length(mu)), nrow = length(mu))
Y <- crossprod(U,Z) + mu
Y <- as.data.frame(t(Y))
names(Y) <- c("x", "y")
return(Y)
}
# Function to add text
LinEqn = function(m) {
l <- list(a = format(coef(m)[1], digits = 2),
b = format(abs(coef(m)[2]), digits = 2));
if (coef(m)[2] >= 0) {
eq <- substitute(italic(y) == a + b %.% italic(x),l)
} else {
eq <- substitute(italic(y) == a - b %.% italic(x),l)
}
as.character(as.expression(eq));
}
# Example
set.seed(700)
n1 <- 1000
mu1 <- c(4, 5)
Sig1 <- matrix(c(1, .8, .8, 1), nrow = length(mu1))
df1 <- GenCorrData(mu1, Sig1, n1)
scatter1 <- ggplot(data = df1, aes(x, y)) +
geom_point(shape = 21, color = "blue", size = 3.5) +
scale_x_continuous(expand = c(0, 0), limits = c(0, 8)) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 8))
scatter.line1 <- scatter1 +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE,
color="black", size = 1) +
annotate("text", x = 2, y = 7, color = "black", size = 5,
label = LinEqn(lm(y ~ x, df1)), parse = TRUE)
scatter.line1