Plotting GLM models in ggplot2 r - r

Apologies for the obvious question but just incase there is a simple answer! Here is an example of what my data looks like:
DATA <- data.frame(
TotalAbund = sample(1:10),
TotalHab = sample(0:1),
TotalInv = sample(c("yes", "no"), 20, replace = TRUE)
)
DATA$TotalHab<-as.factor(DATA$TotalHab)
DATA
I've made the following plot:
p <- ggplot(DATA, aes(x=factor(TotalInv), y=TotalAbund,colour=TotalHab))
p + geom_boxplot() + geom_jitter()
I've created a model as follows:
MOD.1<-glm(TotalAbund~TotalInv+TotalHab, data=DATA)
However, I want to present fitted values from glm model rather than raw data. I know I can simply do it in visreg with:
visreg(MOD.1)
Is there a way to do this with ggplot too? Thanks

You could do something like this:
Create a "prediction frame" containing the relevant values for which you want to predict (if you had a continuous predictor, it would probably make more sense to include evenly spaced values, e.g. seq(min(cont_pred),max(cont_pred),length=51))
pframe <- with(DATA,
expand.grid(TotalInv=unique(TotalInv),
TotalHab=unique(TotalHab)))
Use the predict method to fill in the predicted values:
pframe$TotalAbund <- predict(MOD.1,newdata=pframe)
Add a layer to the graph. The only annoying part is using position_dodge with a manually tweaked width to match the widths of the bars ... (I'm assuming here that you've saved your existing plot as gg1 ...)
gg1 + geom_point(data=pframe,size=8,shape=16,alpha=0.7,
position=position_dodge(width=0.75))

Related

Plotting Chi-square Distribution with ggplot2 in R

I would like to use R to randomly construct chi-square distribution with the degree of freedom of 5 with 100 observations. After doing so, I want to calculate the mean of those observations and use ggplot2 to plot the chi-square distribution with a bar chart. The following is my code:
rm(list = ls())
library(ggplot2)
set.seed(9487)
###Step_1###
x_100 <-data.frame(rchisq(100, 5, ncp = FALSE))
###Step_2###
mean_x <- mean(x_100[,1])
class(x_100)
###Step_3###
plot_x_100 <- ggplot(data = x_100, aes(x = x_100)) +
geom_bar()
plot_x_100
Firstly, I construct a data frame of a random chi-square distribution with df = 5, obs = 100.
Secondly, I calculate the mean value of this chi-square distribution.
At last, I plot the graph with the ggplot2 package.
However, I get the result like the follows:
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
Error in is.finite(x) : default method not implemented for type 'list'
I got stuck in this problem for several hours and cannot find any list in my global environment. It would be appreciated if anyone can help me and give me some suggestions.
The problem is that inside the ggplot function you are calling the same dataframe (x_100) as both the data and the x variable inside aes. Remember that in ggplot, inside aes you should indicate the name of the column you wish to map. Additionally, if you want to plot the chi-square distribution I think it might be a better idea to use the geom_histogram instead of geom_bar, as the first one groups the observations into bins.
library(ggplot2)
# Rename the only column of your data frame as "value"
colnames(x_100) <- "value"
plot_x_100 <- ggplot(data = x_100, aes(x = value)) +
geom_histogram(bins = 20)

plotting log(10) lengths differ

I am having difficulty plotting a log(10) formula on to existing data points. I derived a logarithmic function based on a list of data where "Tout_F_6am" is my independent variable and "clo" is my dependent variable.
When I go to plot it, I am getting the error that lengths x and y are different. Can someone please help me figure out whats going wrong?
logKT=lm(log10(clo)~ Tout_F_6am,data=passive)
summary(logKT) #r2=0.12
coef(logKT)
plot(passive$Tout_F_6am,passive$clo) #plot data points
x=seq(53,84, length=6381)#match length of x variable
y=logKT
lines(x,y,type="l",lwd=2,col="red")
length(passive$Tout_F_6am) #6381
length(passive$clo) #6381
Additionally, can the formula curve(-0.0219-0.005*log10(x),add=TRUE,col=2)be written as eq=(10^-0.022)*(10^-0.005*x)? thanks!
The problem is that you are trying to plot the model object, not the predictions from the model. Try something like this:
Define the explanatory values you want to plot, in a data frame (or tibble). It doesn't have to be as many as there are data points.
library(dplyr)
explanatory_data <- tibble(
Tout_F_6am = seq(53, 84, 0.1)
)
Add a column of predicted values using predict(). This takes a model and your explanatory data. predict() will return the transformed values, so you have to backtransform them.
prediction_data <- explanatory_data %>%
mutate(
log10_clo = predict(logKT, explanatory_data),
clo = 10 ^ log10_clo
)
Finally, draw your plot.
plot(clo ~ Tout_F_6am, data = prediction_data, log="y", type = "l")
The plotting is actually easier using ggplot2. This should give you more or less what you want.
library(ggplot2)
ggplot(passive, aes(Tout_F_6am, clo)) +
geom_point() +
geom_smooth(method = "lm") +
scale_y_log10()

Change colors of select lines in ggplot2 coefficient plot in R

I would like to change the color of coefficient lines based on whether the point estimate is negative or positive in a ggplot2 coefficient plot in R. For example:
require(coefplot)
set.seed(123)
dat <- data.frame(x = rnorm(100), z = rnorm(100))
mod1 <- lm(y1 ~ x + z, data = dat)
coefplot.lm(mod1)
Which produces the following plot:
In this plot, I would like to change the "x" variable to red when plotted. Any ideas? Thanks.
I think, you cannot do this with a plot produced by coefplot.lm. The package coefplot uses ggplot2 as the plotting system, which is good itself, but does not allow to play with colors as easily as you would like. To achieve the desired colors, you need to have a variable in your dataset that would color-code the values; you need to specify color = color-code in aes() function within the layer that draws the dots with CE. Apparently, this is impossible to do with the output of coefplot.lm function. Maybe, you can change the colors using ggplot2 ggplot_build() function. I would say, it's easier to write your own function for this task.
I've done this once to plot odds. If you want, you may use my code. Feel free to change it. The idea is the same as in coefplot. First, we extract coefficients from a model object and prepare the data set for plotting; second, actually plot.
The code for extracting coefficients and data set preparation
df_plot_odds <- function(x){
tmp<-data.frame(cbind(exp(coef(x)), exp(confint.default(x))))
odds<-tmp[-1,]
names(odds)<-c('OR', 'lower', 'upper')
odds$vars<-row.names(odds)
odds$col<-odds$OR>1
odds$col[odds$col==TRUE] <-'blue'
odds$col[odds$col==FALSE] <-'red'
odds$pvalue <- summary(x)$coef[-1, "Pr(>|t|)"]
return(odds)
}
Plot the output of the extract function
plot_odds <- function(df_plot_odds, xlab="Odds Ratio", ylab="", asp=1){
require(ggplot2)
p <- ggplot(df_plot_odds, aes(x=vars, y=OR, ymin=lower, ymax=upper),asp=asp) +
geom_errorbar(aes(color=col),width=0.1) +
geom_point(aes(color=col),size=3)+
geom_hline(yintercept = 1, linetype=2) +
scale_color_manual('Effect', labels=c('Positive','Negative'),
values=c('blue','red'))+
coord_flip() +
theme_bw() +
theme(legend.position="none",aspect.ratio = asp)+
ylab(xlab) +
xlab(ylab) #switch because of the coord_flip() above
return(p)
}
Plotting your example
set.seed(123)
dat <- data.frame(x = rnorm(100),y = rnorm(100), z = rnorm(100))
mod1 <- lm(y ~ x + z, data = dat)
df <- df_plot_odds(mod1)
plot <- plot_odds(df)
plot
Which yields
Note that I chose theme_wb() as the default. Output is a ggplot2object. So, you may change it quite a lot.

Correlation matrix plot with ggplot2

I want to create a correlation matrix plot, i.e. a plot where each variable is plotted in a scatterplot against each other variable like with pairs() or splom(). I want to do this with ggplot2. See here for examples. The link mentions some code someone wrote for doing this in ggplot2, however, it is outdated and no longer works (even after you swap out the deprecated parts).
One could do this with a loop in a loop and then multiplot(), but there must be a better way. I tried melting the dataset to long, and copying the value and variable variables and then using facets. This almost gives you something correct.
d = data.frame(x1=rnorm(100),
x2=rnorm(100),
x3=rnorm(100),
x4=rnorm(100),
x5=rnorm(100))
library(reshape2)
d = melt(d)
d$value2 = d$value
d$variable2 = d$variable
library(ggplot2)
ggplot(data=d, aes(x=value, y=value2)) +
geom_point() +
facet_grid(variable ~ variable2)
This gets the general structure right, but only works for the plotting each variable against itself. Is there some more clever way of doing this without resorting to 2 loops?
library(GGally)
set.seed(42)
d = data.frame(x1=rnorm(100),
x2=rnorm(100),
x3=rnorm(100),
x4=rnorm(100),
x5=rnorm(100))
# estimated density in diagonal
ggpairs(d)
# blank
ggpairs(d, diag = list("continuous"="blank")
Using PerformanceAnalytics library :
library("PerformanceAnalytics")
chart.Correlation(df, histogram = T, pch= 19)

Density plots with multiple groups

I am trying to produce something similar to densityplot() from the lattice package, using ggplot2 after using multiple imputation with the mice package. Here is a reproducible example:
require(mice)
dt <- nhanes
impute <- mice(dt, seed = 23109)
x11()
densityplot(impute)
Which produces:
I would like to have some more control over the output (and I am also using this as a learning exercise for ggplot). So, for the bmi variable, I tried this:
bar <- NULL
for (i in 1:impute$m) {
foo <- complete(impute,i)
foo$imp <- rep(i,nrow(foo))
foo$col <- rep("#000000",nrow(foo))
bar <- rbind(bar,foo)
}
imp <-rep(0,nrow(impute$data))
col <- rep("#D55E00", nrow(impute$data))
bar <- rbind(bar,cbind(impute$data,imp,col))
bar$imp <- as.factor(bar$imp)
x11()
ggplot(bar, aes(x=bmi, group=imp, colour=col)) + geom_density()
+ scale_fill_manual(labels=c("Observed", "Imputed"))
which produces this:
So there are several problems with it:
The colours are wrong. It seems my attempt to control the colours is completely wrong/ignored
There are unwanted horizontal and vertical lines
I would like the legend to show Imputed and Observed but my code gives the error invalid argument to unary operator
Moreover, it seems like quite a lot of work to do what is accomplished in one line with densityplot(impute) - so I wondered if I might be going about this in the wrong way entirely ?
Edit: I should add the fourth problem, as noted by #ROLO:
.4. The range of the plots seems to be incorrect.
The reason it is more complicated using ggplot2 is that you are using densityplot from the mice package (mice::densityplot.mids to be precise - check out its code), not from lattice itself. This function has all the functionality for plotting mids result classes from mice built in. If you would try the same using lattice::densityplot, you would find it to be at least as much work as using ggplot2.
But without further ado, here is how to do it with ggplot2:
require(reshape2)
# Obtain the imputed data, together with the original data
imp <- complete(impute,"long", include=TRUE)
# Melt into long format
imp <- melt(imp, c(".imp",".id","age"))
# Add a variable for the plot legend
imp$Imputed<-ifelse(imp$".imp"==0,"Observed","Imputed")
# Plot. Be sure to use stat_density instead of geom_density in order
# to prevent what you call "unwanted horizontal and vertical lines"
ggplot(imp, aes(x=value, group=.imp, colour=Imputed)) +
stat_density(geom = "path",position = "identity") +
facet_wrap(~variable, ncol=2, scales="free")
But as you can see the ranges of these plots are smaller than those from densityplot. This behaviour should be controlled by parameter trim of stat_density, but this seems not to work. After fixing the code of stat_density I got the following plot:
Still not exactly the same as the densityplot original, but much closer.
Edit: for a true fix we'll need to wait for the next major version of ggplot2, see github.
You can ask Hadley to add a fortify method for this mids class. E.g.
fortify.mids <- function(x){
imps <- do.call(rbind, lapply(seq_len(x$m), function(i){
data.frame(complete(x, i), Imputation = i, Imputed = "Imputed")
}))
orig <- cbind(x$data, Imputation = NA, Imputed = "Observed")
rbind(imps, orig)
}
ggplot 'fortifies' non-data.frame objects prior to plotting
ggplot(fortify.mids(impute), aes(x = bmi, colour = Imputed,
group = Imputation)) +
geom_density() +
scale_colour_manual(values = c(Imputed = "#000000", Observed = "#D55E00"))
note that each ends with a '+'. Otherwise the command is expected to be complete. This is why the legend did not change. And the line starting with a '+' resulted in the error.
You can melt the result of fortify.mids to plot all variables in one graph
library(reshape)
Molten <- melt(fortify.mids(impute), id.vars = c("Imputation", "Imputed"))
ggplot(Molten, aes(x = value, colour = Imputed, group = Imputation)) +
geom_density() +
scale_colour_manual(values = c(Imputed = "#000000", Observed = "#D55E00")) +
facet_wrap(~variable, scales = "free")

Resources