ggplot : Line Plot with Standard Deviations on X Axis - r

I'm trying to use ggplot to create a figure where the X axis is +/-1 SDs of the X variable. I'm not sure what this sort of figure is called or how to go about making it. I've googled ggplot line plot with SDs but have not found anything similar. Any suggestions would be greatly appreciated.
UPDATE:
Here is reproducible code that illustrates where I am at now:
library(tidyverse, ggplot2)
iris <- iris
iris <- iris %>% filter(Species == "virginica" | Species == "setosa")
ggplot(iris, aes(x=scale(Sepal.Length), y=Sepal.Width, group = Species,
shape=Species, linetype=Species))+
geom_line() +
labs(title="Iris Data Example",x="Sepal Length", y = "Sepal Width")+
theme_bw()
There are two main differences between the figure I originally posted and this one:
A) The original figure only contains +1 and -1 SDs, while my example contains -1, 0 +1 and +2.
B) The original figure had a Y mean for -1 and +1 SD on the X axis, while my example has the datapoints all over the place.

The scale function in R subtracts the mean and divides the result by a standard deviations, such that the resulting variable can be interpreted as 'number of standard deviations from the mean'. See also wikipedia.
In ggplot2, you can wrap a variable you want with scale() on the fly in the aes() function.
library(ggplot2)
ggplot(mpg, aes(scale(displ), cty)) +
geom_point()
Created on 2021-08-05 by the reprex package (v1.0.0)
EDIT:
It seems I've not carefully read the legend of the first figure: it seems as if the authors have binned the data based on whether they exceed a positive or negative standard deviation. To bin the data that way we can use the cut function. We can then use the limits of the scale to exclude the (-1, 1] bin and the labels argument to make prettier axis labels.
I've switched around the x and y aesthetics relative to your example, otherwise one of the species didn't have any observations in one of the categories.
library(tidyverse, ggplot2)
iris <- iris
iris <- iris %>% filter(Species == "virginica" | Species == "setosa")
ggplot(iris,
aes(x = cut(scale(Sepal.Width), breaks = c(-Inf, -1,1, Inf)),
y = Sepal.Length, group = Species,
shape = Species, linetype = Species))+
geom_line(stat = "summary", fun = mean) +
scale_x_discrete(
limits = c("(-Inf,-1]", "(1, Inf]"),
labels = c("-1 SD", "+ 1SD")
) +
labs(title="Iris Data Example",y="Sepal Length", x = "Sepal Width")+
theme_bw()
#> Warning: Removed 73 rows containing non-finite values (stat_summary).
Created on 2021-08-05 by the reprex package (v1.0.0)

Related

ggplot(data = df1) with added ggMarginal (data = df2)

I aim to create a ggplot with Date along the x axis, and jump height along the y axis. Simplistically, for 1 athlete in a large group of athletes, this will allow the reader to see improvements in jump height over time.
Additionally, I would like to add a ggMarginal(type = "density") to this plot. Here, I aim to plot the distribution of all athlete jump heights. As a result, the reader can interpret the performance of the primary athlete in relationship to the group distribution.
For the sack of a reproducible example, the Iris df will work.
'''
library(dplyr)
library(ggplot2)
library(ggExtra)
df1 <- iris %<%
filter(Species == "setosa")
df2 <- iris
#I have tried as follows, but a variety of error have occurred:
ggplot(NULL, aes(x=Sepal.Length, y=Sepal.Width))+
geom_point(data=df1, size=2)+
ggMarginal(data = df2, aes(x=Sepal.Length, y=Sepal.Width), type="density", margins = "y", size = 6)
'''
Although this data frame is significantly different than mine, in relation to the Iris data set, I aim to plot x = Sepal.Length, y = Sepal.Width for the Setosa species (df1), and then use ggMarginal to show the distribution of Sepal.Width on the y axis for all the species (df2)
I hope this makes sense!
Thank you for your time and expertise
As far as I get it from the docs you can't specify a separate data frame for ggMarginal. Either you specify a plot to which you want to add a marginal plot or you provide the data directly to ggMarginal.
But one option to achieve your desired result would be to create your density plot as a separate plot and glue it to your main plot via patchwork:
library(ggplot2)
library(patchwork)
df1 <- subset(iris, Species == "setosa")
df2 <- iris
p1 <- ggplot(df1, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(size = 2)
p2 <- ggplot(df2, aes(y = Sepal.Width)) +
geom_density() +
theme_void()
p1 + p2 +
plot_layout(widths = c(6, 1))

How to plot marginal distribution of each attribute?

I am trying to plot the marginal distributions of each attribute c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") for each of the three "Species" of iris. Essentially, for each "Species" I need 4 marginal distribution plots. I tried to use the ks package but cannot seem to split them up into separate species.
I used the following:
attach(iris)
library(ks)
library(rgl)
library(misc3d )
s <- levels(iris$Species)
fhat <- kde(x=iris[iris$Species == s[1], 2])
plot(fhat, cont=50, xlab="Sepal length", main="Setosa")
Is there a way to put this in a loop to produce the 12 plots required? How do I plot it for 2 dimensions?
Using ggplot you can arrange all densities in one plot. To do so you need to first pivot the data into long format and can then facet by the variables and Species:
library(tidyverse)
iris %>%
pivot_longer(Sepal.Length:Petal.Width) %>%
ggplot() +
geom_density(aes(x = value)) +
facet_wrap(~ name + Species, scales = "free")

ggRadar highlight top values in radar

Hi everyone I am making a a radar plot and I want to highlight the two highest values in the factors or levels. Highlight in this case is to make the text of the top tree values bold
require(ggplot2)
require(ggiraph)
require(plyr)
require(reshape2)
require(moonBook)
require(sjmisc)
ggRadar(iris,aes(x=c(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width)))
an example can be like this
thank you
Here is a step-by-step example of how to highlight specific categories in a radar plot. I don't really see the point of all these extra dependencies (ggRadar etc.), as it's pretty straightforward to draw a radar plot in ggplot2 directly using polar coordinates.
First, let's generate some sample data. According to OPs comments and his example based on the iris dataset, we select the maximal value for every variable (from Sepal.Length, Sepal.Width, Petal.Length, Petal.Width); we then store the result in a long tibble for plotting.
library(purrr)
library(dplyr)
library(tidyr)
df <- iris %>% select(-Species) %>% map_df(max) %>% pivot_longer(everything())
df
# # A tibble: 4 x 2
# name value
# <chr> <dbl>
#1 Sepal.Length 7.9
#2 Sepal.Width 4.4
#3 Petal.Length 6.9
#4 Petal.Width 2.5
Next, we make use of a custom coord_radar function (thanks to this post), that is centred around coord_polar and ensures that polygon lines in a polar plot are straight lines rather than curved arcs.
coord_radar <- function (theta = "x", start = - pi / 2, direction = 1) {
theta <- match.arg(theta, c("x", "y"))
r <- if (theta == "x") "y" else "x"
ggproto(
"CordRadar", CoordPolar, theta = theta, r = r, start = start,
direction = sign(direction),
is_linear = function(coord) TRUE)
}
We now create a new column df$face that is "bold" for the top 3 variables (ranked by decreasing value) and "plain" otherwise. We also need to make sure that factor levels of our categories are sorted by row number (otherwise name and face won't necessarily match later).
df <- df %>%
mutate(
rnk = rank(-value),
face = if_else(rnk < 4, "bold", "plain"),
name = factor(name, levels = unique(name)))
We can now draw the plot
library(ggplot2)
ggplot(df, aes(name, value, group = 1)) +
geom_polygon(fill = "red", colour = "red", alpha = 0.4) +
geom_point(colour = "red") +
coord_radar() +
ylim(0, 10) +
theme(axis.text.x = element_text(face = df$face))
Note that this gives a warning, which I choose to ignore here, as we explicitly make use of the vectorised element_text option.
Warning message:
Vectorized input to element_text() is not officially supported.
Results may be unexpected or may change in future versions of ggplot2.
My suggestion would be to identify the highest values you wish to highlight, and put them in a dataframe. Then use geom_richtext() to highlight.

violin_plot() with continuous axis for grouping variable?

The grouping variable for creating a geom_violin() plot in ggplot2 is expected to be discrete for obvious reasons. However my discrete values are numbers, and I would like to show them on a continuous scale so that I can overlay a continuous function of those numbers on top of the violins. Toy example:
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df) + geom_violin(aes(x=factor(x), y=y))
This works as you'd imagine: violins with their x axis values (equally spaced) labelled 1, 2, and 5, with their means at y=1,2,5 respectively. I want to overlay a continuous function such as y=x, passing through the means. Is that possible? Adding + scale_x_continuous() predictably gives Error: Discrete value supplied to continuous scale. A solution would presumably spread the violins horizontally by the numeric x values, i.e. three times the spacing between 2 and 5 as between 1 and 2, but that is not the only thing I'm trying to achieve - overlaying a continuous function is the key issue.
If this isn't possible, alternative visualisation suggestions are welcome. I know I could replace violins with a simple scatter plot to give a rough sense of density as a function of y for a given x.
The functionality to plot violin plots on a continuous scale is directly built into ggplot.
The key is to keep the original continuous variable (instead of transforming it into a factor variable) and specify how to group it within the aesthetic mapping of the geom_violin() object. The width of the groups can be modified with the cut_width argument, depending on the data at hand.
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'lm')
By using this approach, all geoms for continuous data and their varying functionalities can be combined with the violin plots, e.g. we could easily replace the line with a loess curve and add a scatter plot of the points.
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'loess') +
geom_point()
More examples can be found in the ggplot helpfile for violin plots.
Try this. As you already guessed, spreading the violins by numeric values is the key to the solution. To this end I expand the df to include all x values in the interval min(x) to max(x) and use scale_x_discrete(drop = FALSE) so that all values are displayed.
Note: Thanks #ChrisW for the more general example of my approach.
library(tidyverse)
set.seed(42)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T), y = rnorm(1000, mean = x^2))
# y = x^2
# add missing x values
x.range <- seq(from=min(df$x), to=max(df$x))
df <- df %>% right_join(tibble(x = x.range))
#> Joining, by = "x"
# Whatever the desired continuous function is:
df.fit <- tibble(x = x.range, y=x^2) %>%
mutate(x = factor(x))
ggplot() +
geom_violin(data=df, aes(x = factor(x, levels = 1:5), y=y)) +
geom_line(data=df.fit, aes(x, y, group=1), color = "red") +
scale_x_discrete(drop = FALSE)
#> Warning: Removed 2 rows containing non-finite values (stat_ydensity).
Created on 2020-06-11 by the reprex package (v0.3.0)

How to set aside certain numeric values of x with ggplot?

I have a continuous scale including some values which codify different categories of missing (for example 998,999), and I want to make a plot excluding these numeric missing values.
Since the values are together, I can use xlim each time, but since it determines the domain of the plot I have to change the values for each different case.
Then, I ask for a solution. I think in two possibilities.
Is it possible to put non-determining limits to the x-values? I mean, if I give 990 as a maximum limit, but the maximum value that appears is 100, the plot should show an x-range till approximately 100, not 990, as xlim does.
Is there an opposite function to xlim?, meaning that the range determined by the limits (or a discrete set of values given) won't be included in the x-axis.
Thanks in advance.
I think the simplest way is to exclude these values in the plot, either before or during the ggplot call.
MWE
library(tidyverse)
# Create data with overflowing data
mtcars2 <- mtcars
mtcars2[5:15, 'mpg'] <- 998
# Full plot
mtcars2 %>% ggplot() +
geom_point(aes(x = mpg, y = disp))
Filtering before plot
mtcars2 %>%
filter(mpg < 250) %>%
ggplot() +
geom_point(aes(x = mpg, y = disp))
Filtering during plot
mtcars2 %>%
ggplot() +
geom_point(aes(x = mpg, y = disp), data = . %>% filter(mpg < 250))
I would filter those missing values from the original dataset:
library(dplyr)
df <- data.frame(cat = rep(LETTERS[1:4], 3),
values = sample(10, 12, replace = TRUE)
)
# Add missing values
df$values[c(1,5,10)] <- 999
df$values[c(2,7)] <- 998
invalid_values <- c(998, 999)
library(ggplot2)
df %>%
filter(!values %in% invalid_values) %>%
ggplot() +
geom_point(aes(cat, values))
Alternatively, if that's not possible for some reason, you can define a scale transformation:
df %>%
ggplot() +
geom_point(aes(cat, values)) +
scale_y_continuous(trans = scales::trans_new('remove_invalid',
transform = function(d) {d <- if_else(d %in% invalid_values, NA_real_, d)},
inverse = function(d) {if_else(is.na(d), 999, d)}
)
)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 5 rows containing missing values (geom_point).
Created on 2018-05-09 by the reprex package (v0.2.0).

Resources