Visualize test and training set distribution with ggplot2

Visualize test and training set distribution with ggplot2 - r

I am trying to visualize the distribution of a dataset and it's splits into test and training data to check if the split is stratified.
The minimal example uses the iris dataset. It has a species column which is a factor with 3 levels. The following code snippet will show a nice plot with the count for each label, however I would like to see the percentage/probability for the labels in the respective set to see the distribution of the training and test sets.
library("tidyverse")
data(iris)
n = nrow(iris)
idxTrain <- sample(1:n, size = round(0.7*n), replace = F)
train <- iris[idxTrain,]
test <- iris[-idxTrain,]
iris$Set <- rep("Train", time = nrow(iris))
iris$Set[-idxTrain] <- "Test"
ggplot(iris, aes(x = Species, fill = Set)) + geom_bar(position = "dodge")
I tried calculating the percentage as shown below however this does not work, because it shows the percentage of the whole dataframe which shows a distribution similar to the counts.
geom_bar(aes(y = (..count..)/sum(..count..)))
How can I plot the percentage of each label within each set efficiently?
Bonus: Including the whole dataset, train and test.

library("tidyverse")
data(iris)
n = nrow(iris)
idxTrain <- sample(1:n, size = round(0.7*n), replace = F)
train <- iris[idxTrain,]
test <- iris[-idxTrain,]
iris$Set <- rep("Train", time = nrow(iris))
iris$Set[-idxTrain] <- "Test"
you need a separate dataframe for the labels
df_labs <-
iris %>%
group_by(Species) %>%
count(Set) %>%
mutate(pct = n / sum(n)) %>%
filter(Set == "Test")
that you use as the data for the label geom (or text)
ggplot(iris, aes(x = Species, fill = Set)) +
geom_bar(position = "dodge") +
geom_label(data = df_labs, aes(label = scales::percent(pct), y = n / 2))

Related

Elegant ggplot to report summary data and trend at each time point in an RCT

I am analysing an RCT and I wish to report summary statistics (mean with 95%CI) for a number of variables at three time points stratified by treatment allocation. Below is my code so far which only yields this figure.
set.seed(42)
n <- 100
dat1 <- data.frame(id=1:n,
treat = factor(sample(c('Trt','Ctrl'), n, rep=TRUE, prob=c(.5, .5))),
time = factor("T1"),
outcome1=rbinom(n = 100, size = 1, prob = 0.3),
st=runif(n, min=24, max=60),
qt=runif(n, min=.24, max=.60),
zt=runif(n, min=124, max=360)
)
dat2 <- data.frame(id=1:n,
treat = dat1$treat,
time = factor("T2"),
outcome1=dat1$outcome1,
st=runif(n, min=34, max=80),
qt=runif(n, min=.44, max=.90),
zt=runif(n, min=214, max=460)
)
dat3 <- data.frame(id=1:n,
treat = dat1$treat,
time = factor("T3"),
outcome1=dat1$outcome1,
st=runif(n, min=44, max=90),
qt=runif(n, min=.74, max=1.60),
zt=runif(n, min=324, max=1760)
)
dat <- rbind(dat1,dat2, dat3)
ggplot(dat,aes(x=mean(zt), y=time)) + geom_point(aes(colour=treat)) + coord_flip() + geom_line(aes(colour=treat))
I have three questions
can a line be added connecting T1 to T2 to T3 showing the trend
can the 95%CI for the mean be added to each point without having to calculate a "ymin" and "ymax" for all my response variables
if I have multiple response variables (in this example "st", "qt" and "zt") is there a way to produce these all at one as some sort of facet?

Pivot_longer should do most of what you need. Pivot your st, qt, and zt (and whatever other response variables you need). Here I've labeled them "response_variables" and their values as value. You can then facet_wrap by response_variable. Stat_summary will add a line and the mean and ci (se), after group and color by treat. I opted for scales = "free" in facet_wrap otherwise you won't see much going on as zt dominates with its larger range
library(dplyr)
library(ggplot2)
library(Hmisc)
library(tidyr)
dat %>%
pivot_longer(-(1:4), names_to = "response_variables") %>%
ggplot(.,aes(x=value, y=time, group = treat, color = treat)) +
facet_wrap(~response_variables, scales = "free") +
coord_flip() +
stat_summary(fun.data = mean_cl_normal,
geom = "errorbar") +
stat_summary(fun = mean,
geom = "line") +
stat_summary(fun = mean,
geom = "point")

How to specify groups with colors in qqplot()?

I have created a qqplot (with quantiles of beta distribution) from a dataset including two groups. To visualize, which points belong to which group, I would like to color them. I have tried the following:
res <- beta.mle(data$values) #estimate parameters of beta distribution
qqplot(qbeta(ppoints(500),res$param[1], res$param[2]),data$values,
col = data$group,
ylab = "Quantiles of data",
xlab = "Quantiles of Beta Distribution")
the result is shown here:
I have seen solutions specifying a "col" vector for qqnorm, hover this seems to not work with qqplot, as simply half the points is colored in either color, regardless of group. Is there a way to fix this?

A simulated some data just to shown how to add color in ggplot
Libraries
library(tidyverse)
# install.packages("Rfast")
Data
#Simulating data from beta distribution
x <- rbeta(n = 1000,shape1 = .5,shape2 = .5)
#Estimating parameters
res <- Rfast::beta.mle(x)
data <-
tibble(
simulated_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2])
) %>%
#Creating a group variable using quartiles
mutate(group = cut(x = simulated_data,
quantile(simulated_data,seq(0,1,.25)),
include.lowest = T))
Code
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = simulated_data, col = group))+
geom_point()
Output

For those who are wondering, how to work with pre-defined groups, this is the code that worked for me:
library(tidyverse)
library(Rfast)
res <- beta.mle(x)
# make sure groups are not numerrical
# (else color skale might turn out continuous)
g <- plyr::mapvalues(g, c("1", "2"), c("Group1", "Group2"))
data <-
tibble(
my_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2]),
group = g[order(x)]
)
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = my_data, col = group))+
geom_point()
result

R ggplot2: add kruskal Wallis and pairwise Wilcoxon test to boxplots with multiple groups/subgroups within each group and facet

I am trying to add kruskal Wallis and pairwise Wilcoxon test to the figure to show which groups are significant different, but I have multiple groups/subgroups within each group and facet which makes it complicated.
Here is the R code by using iris dataset as an example, the idea is to perform Kruskal.test across different treatments (A, B, C) for different variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) per species, and also wilcox.test pairwise between them:
rm(list=ls(all=TRUE)); cat('\014') # clear workspace
library(tidyverse)
library(ggplot2)
library(viridis)
library(rstatix)
data(iris)
iris$treatment <- rep(c("A","B","C"), length(iris$Species)/3)
mydf <- gather(iris,Variable,value,Sepal.Length:Petal.Width)
# change number to create more difference
mydf[mydf$treatment=="B",]$value <- mydf[mydf$treatment=="B",]$value*1.2
#mydf[mydf$treatment=="C",]$value <- mydf[mydf$treatment=="C",]$value+0.3
# do pairwise Wilcoxon test for pairwise comparisons between groups
df_wilcox <- mydf %>%
group_by(Species,Variable) %>%
pairwise_wilcox_test(value ~ treatment) %>%
add_y_position(step.increase = 0.02)
# do Kruskal Wallis test to see whether or not there is statistically significant difference between three or more groups
df_kw <- compare_means(value ~ treatment, mydf, group.by = c("Species","Variable"), method="kruskal")
# plot boxplot with wilcoxon and kruskal test results
P <- ggplot(data=mydf,
aes(x=treatment, y=value, fill=Variable))+
stat_boxplot(geom = "errorbar")+geom_boxplot(outlier.shape = NA)+
facet_wrap(~Species,nrow=1)+
theme_bw()+
theme(axis.text=element_text(size=12),axis.title=element_text(size=16),plot.title=element_text(size=20)) +
theme(strip.text = element_text(size=14))+
scale_fill_viridis(discrete = TRUE) +
guides(fill=guide_legend(title="Variable"))+
stat_pvalue_manual(df_wilcox,color ="Variable",step.group.by="Variable",tip.length = 0,step.increase = 0.02)
#stat_pvalue_manual(df_wilcox,color ="Variable",step.group.by="Variable",tip.length = 0,step.increase = 0.02,hide.ns=T) #hide non-significant
# change legend title and wilcoxon test color
ggpar(P,legend.title = "Wilcoxon test",palette = c("#440154FF","#3B528BFF","#21908CFF","#FDE725FF"))
This produces the following plot:
To improve the figure, I want to:
automatic add Kruskal test result from 'df_kw' as text to the figure as well and only show significant p-value (e.g. KW(petal.length) p = 0.003)
make the wilcoxon line between treatment (e.g. "A", "B", "C") for different variable (e.g. Petal/Speal Length/Width) looks neat (e.g. all on top of the boxplot with consistent line space)
make the color of wilcoxon test line same as the color of boxplot (now the 'ggpar' don't always work if I hide non-significant, when the wilcoxon test variable is less than the actual variable)
I am stuck here and wondering anyone has a solution? Thank you very much!

I can answer the first part of your question regarding how to add the pvalues labels to the plot automatically. One way to do that is to combine mydf anddf_kw so that df_kw includes all of the same columns as mydf. here I do that using the data.table package like this:
setDT(mydf); setDT(df_kw) # convert to data.tables by reference
df_kw <- mydf[df_kw, mult = "first", on = c("Variable", "Species"), nomatch=0L] #creates data table with the same columns as mydf
df_kw <- df_kw[df_kw$p < 0.05,] #removes non-significant values
Then you can add the labels automatically using geom_text. I would generate a character vector of values to position the labels like this first:
y_lab_placement <- c(sort(rep(seq(max(mydf$value)*1.25, by = -0.35, length.out = length(unique(mydf$Variable))),
length(unique(mydf$Species))), decreasing = T)) # creates y values of where to place the labels
y_lab_placement <- y_lab_placement[1:nrow(df_kw)] # adjusts length of placements to the length of significant values
Then I would add this line to your ggplot to add the labels:
geom_text(data = df_kw, aes(x = 2 , y = y_lab_placement, label = c(paste(Variable, "KW p ~" , round(p, 5)))))+ #adds labels to the plot based on your data
Here is your entire code block including these editions.
rm(list=ls(all=TRUE)); cat('\014') # clear workspace
library(tidyverse)
library(ggplot2)
library(viridis)
library(rstatix)
library(data.table) # used in creating combined data table
data(iris)
iris$treatment <- rep(c("A","B","C"), length(iris$Species)/3)
mydf <- gather(iris,Variable,value,Sepal.Length:Petal.Width)
# change number to create more difference
mydf[mydf$treatment=="B",]$value <- mydf[mydf$treatment=="B",]$value*1.2
#mydf[mydf$treatment=="C",]$value <- mydf[mydf$treatment=="C",]$value+0.3
# do pairwise Wilcoxon test for pairwise comparisons between groups
df_wilcox <- mydf %>%
group_by(Species,Variable) %>%
pairwise_wilcox_test(value ~ treatment) %>%
add_y_position(step.increase = 0.02)
# do Kruskal Wallis test to see whether or not there is statistically significant difference between three or more groups
df_kw <- compare_means(value ~ treatment, mydf, group.by = c("Species","Variable"), method="kruskal")
setDT(mydf); setDT(df_kw) # convert to data.tables by reference
df_kw <- mydf[df_kw, mult = "first", on = c("Variable", "Species"), nomatch=0L] #creates data table with the same columns as mydf
df_kw <- df_kw[df_kw$p < 0.05,] #removes non-significant values
# plot boxplot with wilcoxon and kruskal test results
y_lab_placement <- c(sort(rep(seq(max(mydf$value)*1.25, by = -0.35, length.out = length(unique(mydf$Variable))),
length(unique(mydf$Species))), decreasing = T)) # creates y values of where to place the labels
y_lab_placement <- y_lab_placement[1:nrow(df_kw)] # adjusts length of placements to the length of significant values
P <- ggplot(data=mydf,
aes(x=treatment, y=value, fill=Variable))+
stat_boxplot(geom = "errorbar")+geom_boxplot(outlier.shape = NA)+
facet_wrap(~Species,nrow=1)+
theme_bw()+
theme(axis.text=element_text(size=12),axis.title=element_text(size=16),plot.title=element_text(size=20)) +
theme(strip.text = element_text(size=14))+
scale_fill_viridis(discrete = TRUE) +
guides(fill=guide_legend(title="Variable"))+
geom_text(data = df_kw, aes(x = 2 , y = y_lab_placement, label = c(paste(Variable, "KW p ~" , round(p, 5)))))+ #adds labels to the plot based on your data
stat_pvalue_manual(df_wilcox,color ="Variable",step.group.by="Variable",tip.length = 0,step.increase = 0.02)
#stat_pvalue_manual(df_wilcox,color ="Variable",step.group.by="Variable",tip.length = 0,step.increase = 0.02,hide.ns=T) #hide non-significant
# change legend title and wilcoxon test color
ggpar(P,legend.title = "Wilcoxon test",palette = c("#440154FF","#3B528BFF","#21908CFF","#FDE725FF"))

ggplot2 boxplots - How to avoid extra vertical space when there are no significant comparisons?

After many questions on how to make boxplots with facets and significance levels, particularly this and this, I still have one more little problem.
I managed to produce the plot shown below, which is exactly what I want.
The problem I am facing now is when I have very few, or no significant comparisons; in those cases, the whole space dedicated to the brackets showing the significance levels is still preserved, but I want to get rid of it.
Please check this MWE with the iris dataset:
library(reshape2)
library(ggplot2)
data(iris)
iris$treatment <- rep(c("A","B"), length(iris$Species)/2)
mydf <- melt(iris, measure.vars=names(iris)[1:4])
mydf$treatment <- as.factor(mydf$treatment)
mydf$variable <- factor(mydf$variable, levels=sort(levels(mydf$variable)))
mydf$both <- factor(paste(mydf$treatment, mydf$variable), levels=(unique(paste(mydf$treatment, mydf$variable))))
a <- combn(levels(mydf$both), 2, simplify = FALSE)#this 6 times, for each lipid class
b <- levels(mydf$Species)
CNb <- relist(
paste(unlist(a), rep(b, each=sum(lengths(a)))),
rep.int(a, length(b))
)
CNb
CNb2 <- data.frame(matrix(unlist(CNb), ncol=2, byrow=T))
CNb2
#new p.values
pv.df <- data.frame()
for (gr in unique(mydf$Species)){
for (i in 1:length(a)){
tis <- a[[i]] #variable pair to test
as <- subset(mydf, Species==gr & both %in% tis)
pv <- wilcox.test(value ~ both, data=as)$p.value
ddd <- data.table(as)
asm <- as.data.frame(ddd[, list(value=mean(value)), by=list(both=both)])
asm2 <- dcast(asm, .~both, value.var="value")[,-1]
pf <- data.frame(group1=paste(tis[1], gr), group2=paste(tis[2], gr), mean.group1=asm2[,1], mean.group2=asm2[,2], log.FC.1over2=log2(asm2[,1]/asm2[,2]), p.value=pv)
pv.df <- rbind(pv.df, pf)
}
}
pv.df$p.adjust <- p.adjust(pv.df$p.value, method="BH")
colnames(CNb2) <- colnames(pv.df)[1:2]
# merge with the CN list
pv.final <- merge(CNb2, pv.df, by.x = c("group1", "group2"), by.y = c("group1", "group2"))
# fix ordering
pv.final <- pv.final[match(paste(CNb2$group1, CNb2$group2), paste(pv.final$group1, pv.final$group2)),]
# set signif level
pv.final$map.signif <- ifelse(pv.final$p.adjust > 0.05, "", ifelse(pv.final$p.adjust > 0.01,"*", "**"))
# subset
G <- pv.final$p.adjust <= 0.05
CNb[G]
P <- ggplot(mydf,aes(x=both, y=value)) +
geom_boxplot(aes(fill=Species)) +
facet_grid(~Species, scales="free", space="free_x") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
geom_signif(test="wilcox.test", comparisons = combn(levels(mydf$both),2, simplify = F),
map_signif_level = F,
vjust=0.5,
textsize=4,
size=0.5,
step_increase = 0.06)
P2 <- ggplot_build(P)
#pv.final$map.signif <- "" #UNCOMMENT THIS LINE TO MOCK A CASE WHERE THERE ARE NO SIGNIFICANT COMPARISONS
#pv.final$map.signif[c(1:42,44:80,82:84)] <- "" #UNCOMMENT THIS LINE TO MOCK A CASE WHERE THERE ARE JUST A COUPLE OF SIGNIFICANT COMPARISONS
P2$data[[2]]$annotation <- rep(pv.final$map.signif, each=3)
# remove non significants
P2$data[[2]] <- P2$data[[2]][P2$data[[2]]$annotation != "",]
# and the final plot
png(filename="test.png", height=800, width=800)
plot(ggplot_gtable(P2))
dev.off()
Which produces this plot:
The plot above is exactly what I want... But I am facing cases where there are no significant comparisons, or very few. In these cases, a lot of vertical space is left empty.
To exemplify those scenarios, we can uncomment the line:
pv.final$map.signif <- "" #UNCOMMENT THIS LINE TO MOCK A CASE WHERE THERE ARE NO SIGNIFICANT COMPARISONS
So when there are no significant comparisons I get this plot:
If we uncomment this other line instead:
pv.final$map.signif[c(1:42,44:80,82:84)] <- "" #UNCOMMENT THIS LINE TO MOCK A CASE WHERE THERE ARE JUST A COUPLE OF SIGNIFICANT COMPARISONS
We are in a case where there are only a couple of significant comparisons, and obtain this plot:
So my question here is:
How to adjust the vertical space to the number of significant comparisons, so no vertical space is left there?
There might be something I could change in step_increase or in y_position inside geom_signif(), so I only leave space for the significant comparisons in CNb[G]...

One option is to pre-calculate the p-values for each combination of both levels and then select only the significant ones for plotting. Since we then know up front how many are significant, we can adjust the y-ranges of the plots to account for that. However, it doesn't look like geom_signif is capable of doing only within-facet calculations for the p-value annotations (see the help for the manual argument). Thus, instead of using ggplot's faceting, we instead use lapply to create a separate plot for each Species and then use grid.arrange from the gridExtra package to lay out the individual plots as if they were faceted.
(To respond to the comments, I want to emphasize that the plots are all still created with ggplot2, but we create what would have been the three facet panels of a single plot as three separate plots and then lay them out together as if they had been faceted.)
The function below is hard-coded for the data frame and column names in the OP, but can of course be generalized to take any data frame and column names.
library(gridExtra)
library(tidyverse)
# Change data to reduce number of statistically significant differences
set.seed(2)
df = mydf %>% mutate(value=rnorm(nrow(mydf)))
# Function to generate and lay out the plots
signif_plot = function(signif.cutoff=0.05, height.factor=0.23) {
# Get full range of y-values
y_rng = range(df$value)
# Generate a list of three plots, one for each Species (these are the facets)
plot_list = lapply(split(df, df$Species), function(d) {
# Get pairs of x-values for current facet
pairs = combn(sort(as.character(unique(d$both))), 2, simplify=FALSE)
# Run wilcox test on every pair
w.tst = pairs %>%
map_df(function(lv) {
p.value = wilcox.test(d$value[d$both==lv[1]], d$value[d$both==lv[2]])$p.value
data.frame(levs=paste(lv, collapse=" "), p.value)
})
# Record number of significant p.values. We'll use this later to adjust the top of the
# y-range of the plots
num_signif = sum(w.tst$p.value <= signif.cutoff)
# Plot significance levels only for combinations with p <= signif.cutoff
p = ggplot(d, aes(x=both, y=value)) +
geom_boxplot() +
facet_grid(~Species, scales="free", space="free_x") +
geom_signif(test="wilcox.test", comparisons = pairs[which(w.tst$p.value <= signif.cutoff)],
map_signif_level = F,
vjust=0,
textsize=3,
size=0.5,
step_increase = 0.08) +
theme_bw() +
theme(axis.title=element_blank(),
axis.text.x = element_text(angle=45, hjust=1))
# Return the plot and the number of significant p-values
return(list(num_signif, p))
})
# Get the highest number of significant p-values across all three "facets"
max_signif = max(sapply(plot_list, function(x) x[[1]]))
# Lay out the three plots as facets (one for each Species), but adjust so that y-range is same
# for each facet. Top of y-range is adjusted using max_signif.
grid.arrange(grobs=lapply(plot_list, function(x) x[[2]] +
scale_y_continuous(limits=c(y_rng[1], y_rng[2] + height.factor*max_signif))),
ncol=3, left="Value")
}
Now run the function with four different significance cutoffs:
signif_plot(0.05)
signif_plot(0.01)
signif_plot(0.9)
signif_plot(0.0015)

You can try. Although the answer is similar to my answer here, I added now a function.
library(tidyverse)
library(ggsignif)
# 1. your data
set.seed(2)
df <- as.tbl(iris) %>%
mutate(treatment=rep(c("A","B"), length(iris$Species)/2)) %>%
gather(key, value, -Species, -treatment) %>%
mutate(value=rnorm(n())) %>%
mutate(key=factor(key, levels=unique(key))) %>%
mutate(both=interaction(treatment, key, sep = " "))
# 2. pairwise.wilcox.test for 1) validation and 2) to calculate the ylim
Wilcox <- df %>%
split(., .$Species) %>%
map(~tidy(pairwise.wilcox.test(.$value, .$both, p.adjust.method = "none"))) %>%
map(~filter(.,.$p.value < 0.05)) %>%
bind_rows(.id="Species") %>%
mutate(padjust=p.adjust(p.value, method = "BH"))
# 3. calculate y range
Ylim <- df %>%
summarise(Min=round(min(value)),
Max=round(max(value))) %>%
mutate(Max=Max+0.5*group_by(Wilcox, Species) %>% count() %>% with(.,max(n)))
%>% c()
# 4. the plot function
foo <- function(df, Ylim, Signif=0.05){
P <- df %>%
ggplot(aes(x=both, y=value)) +
geom_boxplot(aes(fill=Species)) +
facet_grid(~Species) +
ylim(Ylim$Min, Ylim$Max)+
theme(axis.text.x = element_text(angle=45, hjust=1)) +
geom_signif(comparisons = combn(levels(df$both),2,simplify = F),
map_signif_level = F, test = "wilcox.test" ) +
stat_summary(fun.y=mean, geom="point", shape=5, size=4) +
xlab("")
# 5. remove not significant values and add step increase
P_new <- ggplot_build(P)
P_new$data[[2]] <- P_new$data[[2]] %>%
filter(as.numeric(as.character(annotation)) < 0.05) %>%
group_by(PANEL) %>%
mutate(index=(as.numeric(group[drop=T])-1)*0.5) %>%
mutate(y=y+index,
yend=yend+index) %>%
select(-index) %>%
as.data.frame()
# the final plot
plot(ggplot_gtable(P_new))
}
foo(df, Ylim)
trying other data
set.seed(12345)
df <- as.tbl(iris) %>%
mutate(treatment=rep(c("A","B"), length(iris$Species)/2)) %>%
gather(key, value, -Species, -treatment) %>%
mutate(value=rnorm(n())) %>%
mutate(key=factor(key, levels=unique(key))) %>%
mutate(both=interaction(treatment, key, sep = " "))
foo(df, list(Min=-3,Max=5))
Ofcourse you can add the Ylim calculation to the function as well. In addition you can change or add ggtitel(), ylab() and change the color.

Predicted vs. Actual plot

I'm new to R and statistics and haven't been able to figure out how one would go about plotting predicted values vs. Actual values after running a multiple linear regression. I have come across similar questions (just haven't been able to understand the code). I would greatly appreciate it if you explain the code.
This is what I have done so far:
# Attach file containing variables and responses
q <- read.csv("C:/Users/A/Documents/Design.csv")
attach(q)
# Run a linear regression
model <- lm(qo~P+P1+P4+I)
# Summary of linear regression results
summary(model)
The plot of predicted vs. actual is so I can graphically see how well my regression fits on my actual data.

It would be better if you provided a reproducible example, but here's an example I made up:
set.seed(101)
dd <- data.frame(x=rnorm(100),y=rnorm(100),
z=rnorm(100))
dd$w <- with(dd,
rnorm(100,mean=x+2*y+z,sd=0.5))
It's (much) better to use the data argument -- you should almost never use attach() ..
m <- lm(w~x+y+z,dd)
plot(predict(m),dd$w,
xlab="predicted",ylab="actual")
abline(a=0,b=1)

Besides predicted vs actual plot, you can get an additional set of plots which help you to visually assess the goodness of fit.
--- execute previous code by Ben Bolker ---
par(mfrow = c(2, 2))
plot(m)

A tidy way of doing this would be to use modelsummary::augment():
library(tidyverse)
library(cowplot)
library(modelsummary)
set.seed(101)
# Using Ben's data above:
dd <- data.frame(x=rnorm(100),y=rnorm(100),
z=rnorm(100))
dd$w <- with(dd,rnorm(100,mean=x+2*y+z,sd=0.5))
m <- lm(w~x+y+z,dd)
m %>% augment() %>%
ggplot() +
geom_point(aes(.fitted, w)) +
geom_smooth(aes(.fitted, w), method = "lm", se = FALSE, color = "lightgrey") +
labs(x = "Actual", y = "Fitted") +
theme_bw()
This will work nicely for deep nested regression lists especially.
To illustrate this, consider some nested list of regressions:
Reglist <- list()
Reglist$Reg1 <- dd %>% do(reg = lm(as.formula("w~x*y*z"), data = .)) %>% mutate( Name = "Type 1")
Reglist$Reg2 <- dd %>% do(reg = lm(as.formula("w~x+y*z"), data = .)) %>% mutate( Name = "Type 2")
Reglist$Reg3 <- dd %>% do(reg = lm(as.formula("w~x"), data = .)) %>% mutate( Name = "Type 3")
Reglist$Reg4 <- dd %>% do(reg = lm(as.formula("w~x+z"), data = .)) %>% mutate( Name = "Type 4")
Now is where the power of the above tidy plotting framework comes to life...:
Graph_Creator <- function(Reglist){
Reglist %>% pull(reg) %>% .[[1]] %>% augment() %>%
ggplot() +
geom_point(aes(.fitted, w)) +
geom_smooth(aes(.fitted, w), method = "lm", se = FALSE, color = "lightgrey") +
labs(x = "Actual", y = "Fitted",
title = paste0("Regression Type: ", Reglist$Name) ) +
theme_bw()
}
Reglist %>% map(~Graph_Creator(.)) %>%
cowplot::plot_grid(plotlist = ., ncol = 1)

Same as #Ben Bolker's solution but getting a ggplot object instead of using base R
#first generate the dd data set using the code in Ben's solution, then...
require(ggpubr)
m <- lm(w~x+y+z,dd)
ggscatter(x = "prediction",
y = "actual",
data = data.frame(prediction = predict(m),
actual = dd$w)) +
geom_abline(intercept = 0,
slope = 1)

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Visualize test and training set distribution with ggplot2 - r

Related

Elegant ggplot to report summary data and trend at each time point in an RCT

How to specify groups with colors in qqplot()?

R ggplot2: add kruskal Wallis and pairwise Wilcoxon test to boxplots with multiple groups/subgroups within each group and facet

ggplot2 boxplots - How to avoid extra vertical space when there are no significant comparisons?

Predicted vs. Actual plot

Categories

Resources