Predicted vs. Actual plot - r

I'm new to R and statistics and haven't been able to figure out how one would go about plotting predicted values vs. Actual values after running a multiple linear regression. I have come across similar questions (just haven't been able to understand the code). I would greatly appreciate it if you explain the code.
This is what I have done so far:
# Attach file containing variables and responses
q <- read.csv("C:/Users/A/Documents/Design.csv")
attach(q)
# Run a linear regression
model <- lm(qo~P+P1+P4+I)
# Summary of linear regression results
summary(model)
The plot of predicted vs. actual is so I can graphically see how well my regression fits on my actual data.

It would be better if you provided a reproducible example, but here's an example I made up:
set.seed(101)
dd <- data.frame(x=rnorm(100),y=rnorm(100),
z=rnorm(100))
dd$w <- with(dd,
rnorm(100,mean=x+2*y+z,sd=0.5))
It's (much) better to use the data argument -- you should almost never use attach() ..
m <- lm(w~x+y+z,dd)
plot(predict(m),dd$w,
xlab="predicted",ylab="actual")
abline(a=0,b=1)

Besides predicted vs actual plot, you can get an additional set of plots which help you to visually assess the goodness of fit.
--- execute previous code by Ben Bolker ---
par(mfrow = c(2, 2))
plot(m)

A tidy way of doing this would be to use modelsummary::augment():
library(tidyverse)
library(cowplot)
library(modelsummary)
set.seed(101)
# Using Ben's data above:
dd <- data.frame(x=rnorm(100),y=rnorm(100),
z=rnorm(100))
dd$w <- with(dd,rnorm(100,mean=x+2*y+z,sd=0.5))
m <- lm(w~x+y+z,dd)
m %>% augment() %>%
ggplot() +
geom_point(aes(.fitted, w)) +
geom_smooth(aes(.fitted, w), method = "lm", se = FALSE, color = "lightgrey") +
labs(x = "Actual", y = "Fitted") +
theme_bw()
This will work nicely for deep nested regression lists especially.
To illustrate this, consider some nested list of regressions:
Reglist <- list()
Reglist$Reg1 <- dd %>% do(reg = lm(as.formula("w~x*y*z"), data = .)) %>% mutate( Name = "Type 1")
Reglist$Reg2 <- dd %>% do(reg = lm(as.formula("w~x+y*z"), data = .)) %>% mutate( Name = "Type 2")
Reglist$Reg3 <- dd %>% do(reg = lm(as.formula("w~x"), data = .)) %>% mutate( Name = "Type 3")
Reglist$Reg4 <- dd %>% do(reg = lm(as.formula("w~x+z"), data = .)) %>% mutate( Name = "Type 4")
Now is where the power of the above tidy plotting framework comes to life...:
Graph_Creator <- function(Reglist){
Reglist %>% pull(reg) %>% .[[1]] %>% augment() %>%
ggplot() +
geom_point(aes(.fitted, w)) +
geom_smooth(aes(.fitted, w), method = "lm", se = FALSE, color = "lightgrey") +
labs(x = "Actual", y = "Fitted",
title = paste0("Regression Type: ", Reglist$Name) ) +
theme_bw()
}
Reglist %>% map(~Graph_Creator(.)) %>%
cowplot::plot_grid(plotlist = ., ncol = 1)

Same as #Ben Bolker's solution but getting a ggplot object instead of using base R
#first generate the dd data set using the code in Ben's solution, then...
require(ggpubr)
m <- lm(w~x+y+z,dd)
ggscatter(x = "prediction",
y = "actual",
data = data.frame(prediction = predict(m),
actual = dd$w)) +
geom_abline(intercept = 0,
slope = 1)

Related

Equal sign changes rendering of legend labels in autoplot of a survfit object

I am using the survival package to make Kaplan-Mayer estimates of survival curves by group and then I plot out the said curves using packages ggfortify and survminer. All works fine except the legend labels for plotting. I want to present N sizes of groups in the legend labels. I thought that adding the N size to the grouping variable itself using paste0 was a good way to go. In my case it is easier than to use something like scale_fill_discrete("", labels = legend_labeller_for_plot).
library(dplyr)
library(ggplot2)
library(survival)
library(survminer)
library(ggfortify)
set.seed = 100
data <- data.frame(
time = rlnorm(20),
event = as.integer(runif(20) < 0.5),
group = ifelse(runif(20) > 0.5,
"group A",
"group B")
)
# Plotting survival curves without N sizes in the legend
fit <- survfit(
with(data, Surv(time, event)) ~ group,
data)
autoplot(fit)
# Adding N sizes to the data and plotting
data_new <- data %>%
group_by(group) %>% mutate(N = n()) %>%
ungroup() %>%
mutate(group_with_N = paste0(group, ", N = ", N))
fit_new <- survfit(
with(data, Surv(time, event)) ~ group_with_N,
data_new)
autoplot(fit_new)
When I try to add N sizes to the groups variable, the part with "N =" in the grouping variable disappears, i.e. the group variable isn't displayed on the legend labels as expected.
For comparison, what I expect is something like the following using Iris data:
What is more, I found that that the culprit is the equali sign =. When I remove the = sign, the legend labels correspond to the grouping variable values.
My question is, why does the equal sign cause this?
An option could be using ggsurvplot where you can specify the legend.labs so you can show your size in the legend like this:
library(dplyr)
library(ggplot2)
library(survival)
library(survminer)
library(ggfortify)
set.seed = 100
data <- data.frame(
time = rlnorm(20),
event = as.integer(runif(20) < 0.5),
group = ifelse(runif(20) > 0.5,
"group A",
"group B")
)
# Adding N sizes to the data and plotting
data_new <- data %>%
group_by(group) %>% mutate(N = n()) %>%
ungroup() %>%
mutate(group_with_N = paste0(group, ", N = ", N))
fit_new <- survfit(
with(data, Surv(time, event)) ~ group_with_N,
data_new)
p <- autoplot(fit_new)
p
# ggsurvplot
ggsurvplot(fit_new, data_new,
legend.labs = unique(sort(data_new$group_with_N)),
conf.int = TRUE)
Created on 2022-08-18 with reprex v2.0.2

How can categorical data be clustered in 3 dimensions?

I have the data obtained from a survey and I would like to analyze it, make clusters and display it 3D as it allows to visualize the information in a more tangible way.
The case is that I have many columns with questions, which the respondents answer: Agree (1), Somewhat agree (0.8), Neutral (0.6), Somewhat disagree (0.4), Disagree (0.2) and finally a numerical rating question, what means are rather categorical data.
A sample of the dataset is shown below:
q1,q2,q3,q4,q5,q6,q7
1,0.8,0.6,0.2,0.2,0.4,10
0.2,1,0.4,0.4,0.4,0.4,9
0.6,1,0.2,0.4,0.2,0.2,6
I am trying to write some code in R based on the following reference: https://plotly.com/r/t-sne-and-umap-projections/
And the code I've tried to run is the following:
library(cluster)
gower_df <- daisy(data,
metric = "gower" ,
type = list(logratio = 2))
silhouette <- c()
silhouette = c(silhouette, NA)
for(i in 2:10){
pam_clusters = pam(as.matrix(gower_df),
diss = TRUE,
k = i)
silhouette = c(silhouette ,pam_clusters$silinfo$avg.width)
}
plot(1:10, silhouette,
xlab = "Clusters",
ylab = "Silhouette Width")
lines(1:10, silhouette)
pam_ = pam(gower_df, diss = TRUE, k = 2)
data[pam_$medoids, ]
pam_summary <- data %>%
mutate(cluster = pam_$clustering) %>%
group_by(cluster) %>%
do(cluster_summary = summary(.))
pam_summary$cluster_summary[[1]]
library(Rtsne)
library(ggplot2)
tsne_object <- Rtsne(gower_df, is_distance = TRUE)
tsne_df <- tsne_object$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_$clustering))
ggplot(aes(x = X, y = Y), data = tsne_df) +
geom_point(aes(color = cluster))
library(plotly)
library(umap)
fig2 <- plot_ly(tsne_df)
fig2
But I get a 2D representation. Any idea how I can do it?

How to specify groups with colors in qqplot()?

I have created a qqplot (with quantiles of beta distribution) from a dataset including two groups. To visualize, which points belong to which group, I would like to color them. I have tried the following:
res <- beta.mle(data$values) #estimate parameters of beta distribution
qqplot(qbeta(ppoints(500),res$param[1], res$param[2]),data$values,
col = data$group,
ylab = "Quantiles of data",
xlab = "Quantiles of Beta Distribution")
the result is shown here:
I have seen solutions specifying a "col" vector for qqnorm, hover this seems to not work with qqplot, as simply half the points is colored in either color, regardless of group. Is there a way to fix this?
A simulated some data just to shown how to add color in ggplot
Libraries
library(tidyverse)
# install.packages("Rfast")
Data
#Simulating data from beta distribution
x <- rbeta(n = 1000,shape1 = .5,shape2 = .5)
#Estimating parameters
res <- Rfast::beta.mle(x)
data <-
tibble(
simulated_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2])
) %>%
#Creating a group variable using quartiles
mutate(group = cut(x = simulated_data,
quantile(simulated_data,seq(0,1,.25)),
include.lowest = T))
Code
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = simulated_data, col = group))+
geom_point()
Output
For those who are wondering, how to work with pre-defined groups, this is the code that worked for me:
library(tidyverse)
library(Rfast)
res <- beta.mle(x)
# make sure groups are not numerrical
# (else color skale might turn out continuous)
g <- plyr::mapvalues(g, c("1", "2"), c("Group1", "Group2"))
data <-
tibble(
my_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2]),
group = g[order(x)]
)
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = my_data, col = group))+
geom_point()
result

Number of displayed decimals in annotation_customs(tableGrob) in R

I want to add a summary table to plot with ggplot. I am using annotation_custom to add a previous created table.
My problem is that the table shows a different number of decimals.
As an example I am using the mtcars database and my lines of code are the following:
rm(list=ls()) #Clear environment console
data(mtcars)
head(mtcars)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
table <- mtcars %>% #summary table that needs to be avelayed to the plot
select(wt) %>%
summarise(
Obs = length(mtcars$wt),
Q05 = quantile(mtcars$wt, prob = 0.05),
Mean = mean(mtcars$wt),
Med = median(mtcars$wt),
Q95 = quantile(mtcars$wt, prob = 0.95),
SD = sd(mtcars$wt))
dens <- ggplot(mtcars) + #Create example density plot for wt variable
geom_density(data = mtcars, aes(mtcars$wt))+
labs(title = "Density plot")
plot(dens)
dens1 <- dens + #Overlaping summary table to density plot
annotation_custom(tableGrob(t(table),
cols = c("WT"),
rows=c("Obs", "Q-05", "Mean", "Median", "Q-95", "S.D." ),
theme = ttheme_default(base_size = 11)),
xmin=4.5, xmax=5, ymin=0.2, ymax=0.5)
print(dens1)
Running the previous I obtain the following picture
density plot
I would like to fix the number of displayed decimals to only 2.
I already tried adding sprintf
annotation_custom(tableGrob(t(sprintf("%0.2f",table)),
But obtained the following error "Error in sprintf("%0.2f", table_pet) :
(list) object cannot be coerced to type 'double'"
I have been looking without any look. Any idea how can I do this.
Thank you in advance
grid.table leaves the formatting up to you,
d = data.frame(x = "pi", y = pi)
d2 = d %>% mutate_if(is.numeric, ~sprintf("%.3f",.))
grid.table(d2)

Visualize test and training set distribution with ggplot2

I am trying to visualize the distribution of a dataset and it's splits into test and training data to check if the split is stratified.
The minimal example uses the iris dataset. It has a species column which is a factor with 3 levels. The following code snippet will show a nice plot with the count for each label, however I would like to see the percentage/probability for the labels in the respective set to see the distribution of the training and test sets.
library("tidyverse")
data(iris)
n = nrow(iris)
idxTrain <- sample(1:n, size = round(0.7*n), replace = F)
train <- iris[idxTrain,]
test <- iris[-idxTrain,]
iris$Set <- rep("Train", time = nrow(iris))
iris$Set[-idxTrain] <- "Test"
ggplot(iris, aes(x = Species, fill = Set)) + geom_bar(position = "dodge")
I tried calculating the percentage as shown below however this does not work, because it shows the percentage of the whole dataframe which shows a distribution similar to the counts.
geom_bar(aes(y = (..count..)/sum(..count..)))
How can I plot the percentage of each label within each set efficiently?
Bonus: Including the whole dataset, train and test.
library("tidyverse")
data(iris)
n = nrow(iris)
idxTrain <- sample(1:n, size = round(0.7*n), replace = F)
train <- iris[idxTrain,]
test <- iris[-idxTrain,]
iris$Set <- rep("Train", time = nrow(iris))
iris$Set[-idxTrain] <- "Test"
you need a separate dataframe for the labels
df_labs <-
iris %>%
group_by(Species) %>%
count(Set) %>%
mutate(pct = n / sum(n)) %>%
filter(Set == "Test")
that you use as the data for the label geom (or text)
ggplot(iris, aes(x = Species, fill = Set)) +
geom_bar(position = "dodge") +
geom_label(data = df_labs, aes(label = scales::percent(pct), y = n / 2))

Resources