I am trying to plot the marginal distributions of each attribute c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") for each of the three "Species" of iris. Essentially, for each "Species" I need 4 marginal distribution plots. I tried to use the ks package but cannot seem to split them up into separate species.
I used the following:
attach(iris)
library(ks)
library(rgl)
library(misc3d )
s <- levels(iris$Species)
fhat <- kde(x=iris[iris$Species == s[1], 2])
plot(fhat, cont=50, xlab="Sepal length", main="Setosa")
Is there a way to put this in a loop to produce the 12 plots required? How do I plot it for 2 dimensions?
Using ggplot you can arrange all densities in one plot. To do so you need to first pivot the data into long format and can then facet by the variables and Species:
library(tidyverse)
iris %>%
pivot_longer(Sepal.Length:Petal.Width) %>%
ggplot() +
geom_density(aes(x = value)) +
facet_wrap(~ name + Species, scales = "free")
Related
I aim to create a ggplot with Date along the x axis, and jump height along the y axis. Simplistically, for 1 athlete in a large group of athletes, this will allow the reader to see improvements in jump height over time.
Additionally, I would like to add a ggMarginal(type = "density") to this plot. Here, I aim to plot the distribution of all athlete jump heights. As a result, the reader can interpret the performance of the primary athlete in relationship to the group distribution.
For the sack of a reproducible example, the Iris df will work.
'''
library(dplyr)
library(ggplot2)
library(ggExtra)
df1 <- iris %<%
filter(Species == "setosa")
df2 <- iris
#I have tried as follows, but a variety of error have occurred:
ggplot(NULL, aes(x=Sepal.Length, y=Sepal.Width))+
geom_point(data=df1, size=2)+
ggMarginal(data = df2, aes(x=Sepal.Length, y=Sepal.Width), type="density", margins = "y", size = 6)
'''
Although this data frame is significantly different than mine, in relation to the Iris data set, I aim to plot x = Sepal.Length, y = Sepal.Width for the Setosa species (df1), and then use ggMarginal to show the distribution of Sepal.Width on the y axis for all the species (df2)
I hope this makes sense!
Thank you for your time and expertise
As far as I get it from the docs you can't specify a separate data frame for ggMarginal. Either you specify a plot to which you want to add a marginal plot or you provide the data directly to ggMarginal.
But one option to achieve your desired result would be to create your density plot as a separate plot and glue it to your main plot via patchwork:
library(ggplot2)
library(patchwork)
df1 <- subset(iris, Species == "setosa")
df2 <- iris
p1 <- ggplot(df1, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(size = 2)
p2 <- ggplot(df2, aes(y = Sepal.Width)) +
geom_density() +
theme_void()
p1 + p2 +
plot_layout(widths = c(6, 1))
I'm trying to use ggplot to create a figure where the X axis is +/-1 SDs of the X variable. I'm not sure what this sort of figure is called or how to go about making it. I've googled ggplot line plot with SDs but have not found anything similar. Any suggestions would be greatly appreciated.
UPDATE:
Here is reproducible code that illustrates where I am at now:
library(tidyverse, ggplot2)
iris <- iris
iris <- iris %>% filter(Species == "virginica" | Species == "setosa")
ggplot(iris, aes(x=scale(Sepal.Length), y=Sepal.Width, group = Species,
shape=Species, linetype=Species))+
geom_line() +
labs(title="Iris Data Example",x="Sepal Length", y = "Sepal Width")+
theme_bw()
There are two main differences between the figure I originally posted and this one:
A) The original figure only contains +1 and -1 SDs, while my example contains -1, 0 +1 and +2.
B) The original figure had a Y mean for -1 and +1 SD on the X axis, while my example has the datapoints all over the place.
The scale function in R subtracts the mean and divides the result by a standard deviations, such that the resulting variable can be interpreted as 'number of standard deviations from the mean'. See also wikipedia.
In ggplot2, you can wrap a variable you want with scale() on the fly in the aes() function.
library(ggplot2)
ggplot(mpg, aes(scale(displ), cty)) +
geom_point()
Created on 2021-08-05 by the reprex package (v1.0.0)
EDIT:
It seems I've not carefully read the legend of the first figure: it seems as if the authors have binned the data based on whether they exceed a positive or negative standard deviation. To bin the data that way we can use the cut function. We can then use the limits of the scale to exclude the (-1, 1] bin and the labels argument to make prettier axis labels.
I've switched around the x and y aesthetics relative to your example, otherwise one of the species didn't have any observations in one of the categories.
library(tidyverse, ggplot2)
iris <- iris
iris <- iris %>% filter(Species == "virginica" | Species == "setosa")
ggplot(iris,
aes(x = cut(scale(Sepal.Width), breaks = c(-Inf, -1,1, Inf)),
y = Sepal.Length, group = Species,
shape = Species, linetype = Species))+
geom_line(stat = "summary", fun = mean) +
scale_x_discrete(
limits = c("(-Inf,-1]", "(1, Inf]"),
labels = c("-1 SD", "+ 1SD")
) +
labs(title="Iris Data Example",y="Sepal Length", x = "Sepal Width")+
theme_bw()
#> Warning: Removed 73 rows containing non-finite values (stat_summary).
Created on 2021-08-05 by the reprex package (v1.0.0)
I have several data-sets which are simple transformations of one another, e.g.
iris0 <- iris ; iris1 <- iris; iris2 <- iris
iris1[,1:4] <- sqrt(iris0[,1:4])
iris2[,1:4] <- log(iris0[,1:4])
I want to visualise how the densities of distributions of each attribute are affected by transformations, using density plots in ggplot2.
I could use code of the following kind:
ggplot() + geom_density(aes(x=Attr), fill="red", data=vec_from_dataset1, alpha=.5) + geom_density(aes(x=Attr), fill="blue", data=vec_from_dataset2, alpha=.5)
or, for example, bind the attributes together and then consider them as one dataset. What is the best, cleanest/most efficient way of (using Map probably) to generate a list of density plots, where iris0 is compared to each other dataset (iris1and iris2), across each numerical attribute i.e. columns 1-4? (So in this case, there would be 4*2 = 8 total density plots.)
(I should clarify--no package except base R+ggplot2 please, dplyr if absolutely necessary)
Edit:
Based on the top answer here: Creating density plots from two different data-frames using ggplot2, I had the following go:
combs = expand.grid(Attributes=names(X),Datasets=c("iris1","iris2"))
plots <-
Map(function(.x, .y, ds2) {
ggplot(data=iris0, aes(x=.x)) +
geom_density(fill="red") +
geom_density(data=get(ds2), fill="purple") +
xlab(.y) + ggtitle(label=paste0("Density plot for the ",.y))
}, X[names(X)], names(X), as.character(combs[[2]]))
But the output is just the density from the first dataset for each attribute (iris0), filled in purple. Can anyone help?
Here's one approach leveraging rbindlist() from package data.table that gives you a list of ggplot objects you can print or do whatever with downstream.
library(data.table)
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 3.6.3
iris0 <- iris ; iris1 <- iris; iris2 <- iris
iris1[,1:4] <- sqrt(iris0[,1:4])
iris2[,1:4] <- log(iris0[,1:4])
dt <- rbindlist(list(iris0 = iris0, iris1 = iris1, iris2 = iris2), idcol = TRUE)
plot_list <- expand.grid(dat = c("iris1", "iris2"),
var = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
stringsAsFactors = FALSE)
zz <- lapply(1:nrow(plot_list), function(i) {
plot_dat <- dt[.id %in% c("iris0", plot_list[i, "dat"]), c(".id", plot_list[i, "var"]), with = FALSE]
plot_names <- names(plot_dat)
ggplot(plot_dat, aes_string(x = plot_names[[2]], fill = plot_names[[1]])) +
geom_density(alpha = .5) +
scale_fill_manual("", values = c("red", "blue")) +
theme_bw() +
theme(legend.position = c(.8, .8))
})
zz[[3]]
Created on 2020-05-14 by the reprex package (v0.3.0)
I am trying to display grouped boxplot and combined boxplot into one plot. Take the iris data for instance:
data(iris)
p1 <- ggplot(iris, aes(x=Species, y=Sepal.Length)) +
geom_boxplot()
p1
I am trying to compare overall distribution with distributions within each categories. So is there a way to display a boxplot of all samples on the left of these three grouped boxplots?
Thanks in advance.
You can rbind a new version of iris, where Species equals "All" for all rows, to iris before piping to ggplot
p1 <- iris %>%
rbind(iris %>% mutate(Species = 'All')) %>%
ggplot(aes(x = Species, y = Sepal.Length)) +
geom_boxplot()
Yes, you can just create a column for all species as follows:
iris = iris %>% mutate(all = "All Species")
p1 <- ggplot(iris) +
geom_boxplot(aes(x=Species, y=Sepal.Length)) +
geom_boxplot(aes(x=all, y=Sepal.Length))
p1
I have 4 clusters that I would like to visualize with ggplot.
I tried to plot it with ggplot but I didn't know how make it look like the figure below. My result was just to present scatterplot showing points not grouped by similarity with centroids.
top50combos_freq : has two columns[freq,freq1]
top50combos_freq.ckmeans : took the result of kmeans with 4 clusters as parameters.
plot(top50combos_freq[top50combos_freq.ckmeans1$cluster==1,],
col = "red",
xlim = c(min(top50combos_freq[,1]), max(top50combos_freq[,1])),
ylim = c(min(top50combos_freq[,2]), max(top50combos_freq[,2]))
)
points(top50combos_freq[top50combos_freq.ckmeans1$cluster==2,],
col="blue")
points(top50combos_freq[top50combos_freq.ckmeans1$cluster==3,],
col="seagreen")
points(top50combos_freq.ckmeans1$centers, pch=2, col="green")
Any help to make this plot with ggplot will appreciated. Thanks.
One way to do that would be to create 2 data frames:
one for actual data points, with a factor variable specifying the cluster,
the other one only with centroids (number of rows same as the number of clusters).
Then you might want to plot the first data frame as usual, but then add additional geom, where you specify new data frame.
Example using iris data:
library(ggplot2)
# Data frame with actual data points
plotDf <- iris
# Data frame with centroids, one entry per centroid
centroidsDf <- data.frame(
Sepal.Length = tapply(iris$Sepal.Length, iris$Species, mean),
Sepal.Width = tapply(iris$Sepal.Width, iris$Species, mean)
)
# First plot data, colouring by cluster (in this case Species variable)
ggplot(
data = plotDf,
aes(x = Sepal.Length, y = Sepal.Width, col = Species)
) +
geom_point() +
# Then add centroids
geom_point(
data = centroidsDf, # separate data.frame
aes(x = Sepal.Length, y = Sepal.Width),
col = "green", # notice "col" and "shape" are
shape = 2) # outside aes()