I am an R beginner (first semester - we us this programme for univariate statistics) and currently struggling with plotting the outcome of my glm(). I read quite a few threads and help files on the internet, but I have 2 problems: 1) I don't understand the advice because it is too advanced or 2) I understand the advice but when I replicate the code, it doesn't work.
I think I am close to the solution, but my curve doesn't work how it is supposed to. Can anyone tell me what I am doing wrong?
new.data<-data.frame(x=rnorm(50,0,1), y=c("yes", "no"))
mock_model<-glm(y~x, data=new.data, family=binomial)
x1<-seq(min(new.data$x), max(new.data$x), 0.01)
y1<-predict(mock_model, list(x=x1), type="response")
plot(new.data$x, new.data$y, xlab="numeric var", ylab="binary var")
points(x1, y1)
I am new to coding and this platform, so apologies in advance if the information I have provided is not sufficient.
Any advice would be greatly appreciated.
Here's an example using mtcars and the ggplot2 package. The syntax of ggplot2 works roughly like this: You begin a plot with the ggplot() command, within which you can (but don't have to) define aesthetics (the aes() option), which include selection of axis variables, but can also contain options to change the visuals, like colors, linewidths etc. If you define the axis variables within ggplot(), don't forget to put the data assignment (see example below) outside of aes().
Afterwards, you add layers of geoms to plot specific things, like data points with geom_point(), lines with geom_line() or a lot of other fun things. When you want to use the variables and data assigned in the ggplot() command, just leave the geom empty (apart from any visual aes() options you want to use for that specific geom). However, you can define new data and variables for a geom, for example to use different data sources in the same plot.
data(mtcars)
model_shift <- glm(am ~ mpg, data = mtcars, family = 'binomial')
x <- seq(min(mtcars$mpg), max(mtcars$mpg), .1)
y <- predict(model_shift, list(mpg = x), type = 'response')
plot_data <- data.frame(mpg = x, am = y)
library(ggplot2)
ggplot(aes(x = mpg, y = am), data = plot_data) +
geom_point()
Or with a line instead of points:
ggplot(aes(x = mpg, y = am), data = plot_data) +
geom_line()
To get a glimpse of the seemingly endless possibilities of ggplot2, have a look at these 'Top 50' ggplot2 visualizations. To learn the package-specific language, see this tutorial or check your university's library for Hadley Wickham's book ggplot2: elegant graphics for data analysis.
Related
I am using quantile regression in R with the qgam package and visualising them using the mgcViz package, but I am struggling to understand how to control the appearance of the plots. The package effectively turns gams (in my case mqgams) into ggplots.
Simple reprex:
egfit <- mqgam(data = iris,
Sepal.Length ~ s(Petal.Length),
qu = c(0.25,0.5,0.75))
plot.mgamViz(getViz(egfit))
I am able to control things that can be added, for example the axis labels and theme of the plot, but I'm struggling to effect things that would normally be addressed in the aes() or geom_x() functions.
How would I control the thickness of the line? If this were a normal geom_smooth() or geom_line() I'd simply put size = 1 inside of the geoms, but I cannot see how I'd do so here.
How can I control the linetype of these lines? The "id" is continuous and one cannot supply a linetype to a continuous scale. If this were a nomral plot I would convert "id" to a character, but I can't see a way of doing so with the plot.mgamViz function.
How can I supply a new colour scale? It seems as though if I provide it with a new colour scale it invents new ID values to put on the legend that don't correlate to the actual "id" values, e.g.
plot.mgamViz(getViz(egfit)) + scale_colour_viridis_c()
I fully expect this to be relatively simple and I'm missing something obvious, and imagine the answer to all three of these subquestions are very similar to one another. Thanks in advance.
You need to extract your ggplot element using this:
p1 <- plot.mgamViz(getViz(egfit))
p <- p1$plots [[1]]$ggObj
Then, id should be as.factor:
p$data$id <- as.factor(p$data$id)
Now you can play with ggplot elements as you prefer:
library(mgcViz)
egfit <- mqgam(data = iris,
Sepal.Length ~ s(Petal.Length),
qu = c(0.25,0.5,0.75))
p1 <- plot.mgamViz(getViz(egfit))
# Taking gg infos and convert id to factor
p <- p1$plots [[1]]$ggObj
p$data$id <- as.factor(p$data$id)
# Changing ggplot attributes
p <- p +
geom_line(linetype = 3, size = 1)+
scale_color_brewer(palette = "Set1")+
labs(x="Petal Length", y="s(Petal Length)", color = "My ID labels:")+
theme_classic(14)+
theme(legend.position = "bottom")
p
Here the generated plot:
Hope it is useful!
I calculated a linear-mixed model using the nlme package. I was evaluating a psychological treatment and used treatment condition and measurement point as predictors. I did post-hoc comparisons using the emmans package. So far so good, everything worked out well and I am looking forward to finish my thesis. There is only one problem left. I am really really bad in plotting. I want to plot the emmeans for the four measurement points for each group. The emmip function in emmeans does this, but I am not that happy with the result. I used the following code to generate the result:
emmip(HLM_IPANAT_pos, Gruppe~TP, CIs=TRUE) + theme_bw() + labs(x = "Zeit", y = "IPANAT-PA")
I don't like the way the confidence intervals are presented. I would prefer a line bar with "normal" confidence bars, like the one below, which is taken from Ireland et al. (2017). I tried to do it in excel, but did not find out how to integrate seperate confidence intervals for each line. So I was wondering if there was the possibility to do it using ggplot2. However, I do not know how to integrate the values I obtained using emmeans in ggplot. As I said, I really have no idea about plotting. Does someone know how to do it?
I think it is possible. Rather than using emmip to create the plot, you could use emmeans to get the values for ggplot2. With ggplot2 and the data, you might be able to better control the format of the plot. Since I do not have your data, I can only suggest a few steps.
First, after fitting the model HLM_IPANAT_pos, get values using emmeans. Second, broom::tidy this object. Third, ggplot the above broom::tidy object.
Using mtcars data as an example:
library(emmeans)
# mtcars data
mtcars$cyl = as.factor(mtcars$cyl)
# Model
mymodel <- lm(mpg ~ cyl * am, data = mtcars)
# using ggplot2
library(tidyverse)
broom::tidy(emmeans(mymodel, ~ am | cyl)) %>%
mutate(cyl_x = as.numeric(as.character(cyl)) + 0.1*am) %>%
ggplot(aes(x = cyl_x, y = estimate, color = as.factor(am))) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.1)
Created on 2019-12-29 by the reprex package (v0.3.0)
I'm currently trying to get my head around the differences between stat_* and geom_* in ggplot2. (Please note this is more of an interest/understanding based question than a specific problem I am trying solve).
Introduction
My current understanding is that is that the stat_* functions apply a transformation to your data and that the result is then passed onto the geom_* to be displayed.
Most simple example being the identity transformation which simply passes your data untransformed onto the geom.
ggplot(data = iris) +
stat_identity(aes(x = Sepal.Length, y = Sepal.Width) , geom= "point")
More practical use-cases appear to be when you want to use some transformation and supply the results to a non-default geom, for example if you wanted to plot an error bar of the 1st and 3rd quartile you could do something like:
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = Sepal.Length, ymax = ..upper.., ymin = ..lower..), geom = "errorbar")
Question 1
So how / when are these transformations applied to the dataset and how does data pass through them exactly?
As an example, say I wanted to take the stat_boxplot transformation and plot the point of the 3rd quartile how would I do this ?
My intuition would be something like :
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = ..upper..) , geom = "point")
or
ggplot(data = iris) +
stat_boxplot(aes(x=Species, y = Sepal.Length) , geom = "point")
however both error with
Error: geom_point requires the following missing aesthetics: y
My guess is as part of the stat_boxplot transformation it consumes the y aesthetic and produces a dataset not containing any y variable however this leads onto ....
Question 2
Where can I find out which variables are consumed as part of the stat_* transformation and what variables they output? Maybe i'm looking in the wrong places but the documentation does not seem clear to me at all...
Interesting questions...
As background info, you can read this chapter of R for Data Science, focusing on the grammar of graphics. I'm sure Hadley Wickham's book on ggplot2 is even a better source, but I don't have that one.
The main steps for building a graph with one layer and no facet are:
Apply aesthetics mapping on input data (in simple cases, this is a selection and renaming on columns)
Apply scale transformation (if any) on each data column
Compute stat on each data group (i.e. per Species in this case)
Apply aesthetics mapping on stat data, detected with ..<name>.. or stat(name)
Apply position adjustment
Build graphical objects
Apply coordinate transformations
As you guessed, the behaviour at step 3 is similar to dplyr::transmute(): it consumes all aesthetics columns and outputs a data frame having as columns all freshly computed stats and all columns that are constant within the group. The stat output may also have a different number of rows from its input. Thus indeed in your example the y column isn't passed to the geom.
To do this, we'd like to specify different mappings at step 1 (before stat) and at step 4 (before geom). I thought something like this would work:
# This does not work:
ggplot(data = iris) +
geom_point(
aes(x=Species, y=stat(upper)),
stat=stat_boxplot(aes(x=Species, y=Sepal.Length)) )
... but it doesn't (stat must be a string or a Stat object, but stat_boxplot actually returns a Layer object, like geom_point does).
NB: stat(upper) is an equivalent, more recent, notation to your ..upper..
I might be wrong but I don't think there is a way of doing this directly within ggplot. What you can do is extract the stat part of the process above and manage it yourself before entering ggplot():
library(tidyverse)
iris %>%
group_by(Species) %>%
select(y=Sepal.Length) %>%
do(StatBoxplot$compute_group(.)) %>%
ggplot(aes(Species, upper)) + geom_point()
A bit less elegant, I admit...
For your question 2, it's in the doc: see sections Aesthetics and Computed variables of ?stat_boxplot
I would like to get my statistical test results integrated to my plot. Example of my script with dummy variables (dummy data below generated after first post):
cases <- rep(1:1:5,times=10)
var1 <- rep(11:15,times=10)
outcome <- rep(c(1,1,1,2,2),times=10)
maindata <- data.frame(cases,var1,outcome)
df1 <- maindata %>%
group_by(cases) %>%
select(cases,var1,outcome) %>%
summarise(var1 = max(var1, na.rm = TRUE), outcome=mean(outcome, na.rm =TRUE))
wilcox.test(df1$var1[df1$outcome<=1], df1$var1[df1$outcome>1])
ggplot(df1, aes(x = as.factor(outcome), y = as.numeric(var1), fill=outcome)) + geom_boxplot()
With these everything works just fine, but I can't find a way to integrate my wilcox.test results to my plot automatically (of course I can make use annotation() and write the results manually but that's not what I'm after.
My script produces two boxplots with max-value of var1 on the y-axis and grouped by outcome on the x-axis (only two different values for outcome). I would like to add my wilcox.test results to that boxplot, all other relevant data is present. Tried to find a way from forums and help files but can't find a way (at least with ggplot2)
I'm new to R and trying learn stuff through using ggplot2 and dplyr which I see as most intuitive packages for manipulation and visualization. Don't know if they are optimal for the solution which I'm after so feel free to suggest solutions from alternative packages also...
I thinks this figure shows what you want. I also added some parts to the code because you're new with ggplot2. Take or leave them, but there're things I do make publication quality figures:
wtOut = wilcox.test(df1$var1[df1$outcome<=1], df1$var1[df1$outcome>1])
exampleOut <- ggplot(df1,
aes(x = as.factor(outcome), y = as.numeric(var1), fill=outcome)) +
geom_boxplot() +
scale_fill_gradient(name = paste0("P-value: ",
signif(wtOut$p.value, 3), "\nOutcome")) +
ylab("Variable 1") + xlab("Outcome") + theme_bw()
ggsave('exampleOut.jpg', exampleOut, width = 6, height = 4)
If you want to include the p-value as its own legend, it looks like it is some work, but doable.
Or, if you want, just throw signif(wtOut$p.value, 3) into annotate(...). You'll just need to come up with rules for where to place it.
My question has to do with facetting. In my example code below, I look at some facetted scatterplots, then try to overlay information (in this case, mean lines) on a per-facet basis.
The tl;dr version is that my attempts fail. Either my added mean lines compute across all data (disrespecting the facet variable), or I try to write a formula and R throws an error, followed by incisive and particularly disparaging comments about my mother.
library(ggplot2)
# Let's pretend we're exploring the relationship between a car's weight and its
# horsepower, using some sample data
p <- ggplot()
p <- p + geom_point(aes(x = wt, y = hp), data = mtcars)
print(p)
# Hmm. A quick check of the data reveals that car weights can differ wildly, by almost
# a thousand pounds.
head(mtcars)
# Does the difference matter? It might, especially if most 8-cylinder cars are heavy,
# and most 4-cylinder cars are light. ColorBrewer to the rescue!
p <- p + aes(color = factor(cyl))
p <- p + scale_color_brewer(pal = "Set1")
print(p)
# At this point, what would be great is if we could more strongly visually separate
# the cars out by their engine blocks.
p <- p + facet_grid(~ cyl)
print(p)
# Ah! Now we can see (given the fixed scales) that the 4-cylinder cars flock to the
# left on weight measures, while the 8-cylinder cars flock right. But you know what
# would be REALLY awesome? If we could visually compare the means of the car groups.
p.with.means <- p + geom_hline(
aes(yintercept = mean(hp)),
data = mtcars
)
print(p.with.means)
# Wait, that's not right. That's not right at all. The green (8-cylinder) cars are all above the
# average for their group. Are they somehow made in an auto plant in Lake Wobegon, MN? Obviously,
# I meant to draw mean lines factored by GROUP. Except also obviously, since the code below will
# print an error, I don't know how.
p.with.non.lake.wobegon.means <- p + geom_hline(
aes(yintercept = mean(hp) ~ cyl),
data = mtcars
)
print(p.with.non.lake.wobegon.means)
There must be some simple solution I'm missing.
You mean something like this:
rs <- ddply(mtcars,.(cyl),summarise,mn = mean(hp))
p + geom_hline(data=rs,aes(yintercept=mn))
It might be possible to do this within the ggplot call using stat_*, but I'd have to go back and tinker a bit. But generally if I'm adding summaries to a faceted plot I calculate the summaries separately and then add them with their own geom.
EDIT
Just a few expanded notes on your original attempt. Generally it's a good idea to put aes calls in ggplot that will persist throughout the plot, and then specify different data sets or aesthetics in those geom's that differ from the 'base' plot. Then you don't need to keep specifying data = ... in each geom.
Finally, I came up with a kind of clever use of geom_smooth to do something similar to what your asking:
p <- ggplot(data = mtcars,aes(x = wt, y = hp, colour = factor(cyl))) +
facet_grid(~cyl) +
geom_point() +
geom_smooth(se=FALSE,method="lm",formula=y~1,colour="black")
The horizontal line (i.e. constant regression eqn) will only extend to the limits of the data in each facet, but it skips the separate data summary step.