I think ggplot is inconsistent in how it treats -inf in log scale. Below are two plots that display similar information. In the first case, I use a box plot to show the 25th, 50th, and 75th percentiles of the example dataset on a log scale. As you can see, it removes the 0 (or - inf values on the log scale) and creates the box plot from the remaining data points.
In the second example, I precalculate the 25th, 50th, and 75th percentiles and then use geom_point to make a similar plot with log scales. In this case, the zero value is simply plotted at the bottom of the axis instead of being removed. I think this behaviour is better. Is there a way to plot the box plot without removing the -inf values?
Thanks a lot.
library(tidyverse)
# create dataset
Test_Data <- data.frame("Values" = seq(0,10000,1))
Test_Data$Values[Test_Data$Values%%3==0] <- 0
Test_Data$Groups[Test_Data$Values%%2==0] <- 'A'
Test_Data$Groups[Test_Data$Values%%2!=0] <- 'B'
Test_Data$Values <- Test_Data$Values/10000
# plot boxplot
ggplot(data = Test_Data, aes(x= Groups, y = Values)) +
geom_boxplot()+
scale_y_continuous(trans='log10', limits = c(1e-5, 1e1)) +
annotation_logticks(sides = "l")
# Create dataset with quantile measures
Test_Data_Processed <- Test_Data %>%
select_all() %>%
group_by(Groups) %>%
summarise(Num = n(),
Median = median(Values),
Percnt_25 = quantile(Values,.25),
Percnt_75 = quantile(Values, .75)) %>%
gather(Measure, Value, Median:Percnt_75)
# plot with geom_point
ggplot(data = Test_Data_Processed, aes(x= Groups, y = Value, shape = Measure, color = Groups)) +
geom_point(size = 4) +
scale_y_continuous(trans='log10', limits = c(1e-5, 1e1)) +
annotation_logticks(sides = "l")
Related
I have a dataset at the municipality level. I would like to draw a histogram of a given variable and, at the same time, fill the bars with another continuous variable (using a color gradient). This is because I believe the municipalities with low values of the variable I am plotting the histogram for have very different population size (on average) when comparing with the municipalities that are in the upper end of the distribution.
Using the mtcar data, say I would like to plot the distribution of mpg and fill the bars with a continuous color to represent the mean of the variable wt for each of the histogram bars. I typed the code below but I don't know how to actually make the fill option take the average of wt. I would want a legend to show up with a color gradient so as to inform if the mean value of wt for each histogram bar is low-medium-high in relative terms.
mtcars %>%
ggplot(aes(x=mpg, fill=wt)) +
geom_histogram()
If you want a genuine histogram you need to transform your data to do this by summarizing it first, and plot with geom_col rather than geom_histogram. The base R function hist will help you here to generate the breaks and midpoints:
library(ggplot2)
library(dplyr)
mtcars %>%
mutate(mpg = cut(x = mpg,
breaks = hist(mpg, breaks = 0:4 * 10, plot = FALSE)$breaks,
labels = hist(mpg, breaks = 0:4 * 10, plot = FALSE)$mids)) %>%
group_by(mpg) %>%
summarize(n = n(), wt = mean(wt)) %>%
ggplot(aes(x = as.numeric(as.character(mpg)), y = n, fill = wt)) +
scale_x_continuous(limits = c(0, 40), name = "mpg") +
geom_col(width = 10) +
theme_bw()
It is not a histogram exactly, but was the closest that I could think for your problem
library(tidyverse)
mtcars %>%
#Create breaks for mpg, where this sequence is just an example
mutate(mpg_cut = cut(mpg,seq(10,35,5))) %>%
#Count and mean of wt by mpg_cut
group_by(mpg_cut) %>%
summarise(
n = n(),
wt = mean(wt)
) %>%
ggplot(aes(x=mpg_cut, fill=wt)) +
#Bar plot
geom_col(aes(y = n), width = 1)
I am plotting some density curves, and I want to add a point at the mean of each group. However, I want to plot these points along the top of the density curve, not at 0. Is there a way to come up with a value of the density at the mean point within groups? code follows:
# make df
df<- data.frame(group=c("a","b",'c'),
value=rnorm(
3000,
mean=c(1,2,3),
sd=c(1,1.5,1)
))
library(tidyverse)
library(ggridges)
library(ggdist)
Way 1: density ridges from ggridges ppackage
df %>%
# calculate mean density per group to use later
group_by(group)%>%
mutate(mean_value=mean(value)) %>%
ggplot()+
aes(x=value,y=group)+
geom_density_ridges()+
# could do with stat summary - blue points
stat_summary(
orientation = "y",
fun = mean,
geom = "point",
color="blue"
)+
# or could do with geom_point using precalculated value (red points)
# nudged so we can see both.
geom_point(aes(x=mean_value,y=group),
color="red",
position = position_nudge(x=.1)
)
way 2: stat_halfeye from ggdist package
df %>%
group_by(group)%>%
mutate(mean_value=mean(value)) %>%
# mutate(mean_density = density(mean_value,value))
ggplot()+
aes(x=value,y=group)+
stat_halfeye()+
# could do with stat summary
stat_summary(
orientation = "y",
fun = mean,
geom = "point",
color="blue",
alpha=.8
)+
# or could do with geom_point using precalculated value
# nudged so we can see both.
geom_point(aes(x=mean_value,y=group),
color="red",
position = position_nudge(x=.1)
)
desired output: for these blue or red points to be at the top of the density curve. So I will need a y aesthetic that is something like "group + density value."
Would rather use way 2 (ggdist) than geom_density ridges
Thanks
I'm not sure if there's a way to calculate the height of the density curve at the mean value within the ggplot geom/stat functions, so I've created a couple of helper functions to do that.
dens_at_mean calculates the height of the density curve at the mean of the data. get_mean_coords runs dens_at_mean by group and then scales the height values to match the y-values generated by stat_halfeye and returns a data frame that can be passed to geom_point.
# Reproducible data
set.seed(394)
df<- data.frame(group=c("a","b",'c'),
value=rnorm(
3000,
mean=c(1,2,3),
sd=c(1,1.5,1)
))
# Function to get height of density curve at mean value
dens_at_mean = function(x) {
d = density(x)
mean.x = mean(x)
data.frame(mean.x = mean.x,
max.y = max(d$y),
mean.y = approx(d$x, d$y, xout=mean.x)$y)
}
# Function to return data frame with properly scaled heights
# to plot mean points
get_mean_coords = function(data, value.var, group.var) {
data %>%
group_by({{group.var}}) %>%
summarise(vals = list(dens_at_mean({{value.var}}))) %>%
ungroup %>%
unnest_wider(vals) %>%
# Scale y-value to work properly with stat_halfeye
mutate(mean.y = (mean.y/max(max.y) * 0.9 + 1:n())) %>%
select(-max.y)
}
df %>%
ggplot()+
aes(x=value, y=group)+
stat_halfeye() +
geom_point(data=get_mean_coords(df, value, group),
aes(x=mean.x, y=mean.y),
color="red", size=2) +
theme_bw() +
scale_y_discrete(expand=c(0.08,0.05))
The grouping variable for creating a geom_violin() plot in ggplot2 is expected to be discrete for obvious reasons. However my discrete values are numbers, and I would like to show them on a continuous scale so that I can overlay a continuous function of those numbers on top of the violins. Toy example:
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df) + geom_violin(aes(x=factor(x), y=y))
This works as you'd imagine: violins with their x axis values (equally spaced) labelled 1, 2, and 5, with their means at y=1,2,5 respectively. I want to overlay a continuous function such as y=x, passing through the means. Is that possible? Adding + scale_x_continuous() predictably gives Error: Discrete value supplied to continuous scale. A solution would presumably spread the violins horizontally by the numeric x values, i.e. three times the spacing between 2 and 5 as between 1 and 2, but that is not the only thing I'm trying to achieve - overlaying a continuous function is the key issue.
If this isn't possible, alternative visualisation suggestions are welcome. I know I could replace violins with a simple scatter plot to give a rough sense of density as a function of y for a given x.
The functionality to plot violin plots on a continuous scale is directly built into ggplot.
The key is to keep the original continuous variable (instead of transforming it into a factor variable) and specify how to group it within the aesthetic mapping of the geom_violin() object. The width of the groups can be modified with the cut_width argument, depending on the data at hand.
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'lm')
By using this approach, all geoms for continuous data and their varying functionalities can be combined with the violin plots, e.g. we could easily replace the line with a loess curve and add a scatter plot of the points.
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'loess') +
geom_point()
More examples can be found in the ggplot helpfile for violin plots.
Try this. As you already guessed, spreading the violins by numeric values is the key to the solution. To this end I expand the df to include all x values in the interval min(x) to max(x) and use scale_x_discrete(drop = FALSE) so that all values are displayed.
Note: Thanks #ChrisW for the more general example of my approach.
library(tidyverse)
set.seed(42)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T), y = rnorm(1000, mean = x^2))
# y = x^2
# add missing x values
x.range <- seq(from=min(df$x), to=max(df$x))
df <- df %>% right_join(tibble(x = x.range))
#> Joining, by = "x"
# Whatever the desired continuous function is:
df.fit <- tibble(x = x.range, y=x^2) %>%
mutate(x = factor(x))
ggplot() +
geom_violin(data=df, aes(x = factor(x, levels = 1:5), y=y)) +
geom_line(data=df.fit, aes(x, y, group=1), color = "red") +
scale_x_discrete(drop = FALSE)
#> Warning: Removed 2 rows containing non-finite values (stat_ydensity).
Created on 2020-06-11 by the reprex package (v0.3.0)
I need to overlay normal density curves on 3 histograms sharing the same y-axis. The curves need to be separate for each histogram.
My dataframe (example):
height <- seq(140,189, length.out = 50)
weight <- seq(67,86, length.out = 50)
fev <- seq(71,91, length.out = 50)
df <- as.data.frame(cbind(height, weight, fev))
I created the histograms for the data as:
library(ggplot)
library(tidyr)
df %>%
gather(key=Type, value=Value) %>%
ggplot(aes(x=Value,fill=Type)) +
geom_histogram(binwidth = 8, position="dodge")
I am now stuck at how to overlay normal density curves for the 3 variables (separate curve for each histogram) on the histograms that I have generated. I won't mind the final figure showing either count or density on the y-axis.
Any thoughts on how to proceed from here?
Thanks in advance.
I believe that the code in the question is almost right, the code below just uses the answer in the link provided by #akrun.
Note that I have commented out the call to facet_wrap by placing a comment char before the last plus sign.
library(ggplot2)
library(tidyr)
df %>%
gather(key = Type, value = Value) %>%
ggplot(aes(x = Value, color = Type, fill = Type)) +
geom_histogram(aes(y = ..density..),
binwidth = 8, position = "dodge") +
geom_density(alpha = 0.25) #+
facet_wrap(~ Type)
I want to overlay two density plots; one of data prior to transformation and one after. I don't care about the x and y values, only the shape of the curve.
I want to superimpose the 2 charts for a given Predictor on top of each other, even though the x-axis is different. I find it hard to look across the two facets. In reality, as well, there will be a lot more plots, so combining the non-transformed and transformed data into the one would be the best solution.
library(tidyverse)
require(caret)
data(BloodBrain)
bbbTrans <- preProcess(select(bbbDescr, adistd, adistm, dpsa3, inthb), method = "YeoJohnson")
bbbTransData <- predict(bbbTrans, select(bbbDescr, adistd, adistm, dpsa3, inthb))
dat <- bbbTransData %>%
gather(Predictor, Value) %>%
mutate(Transformation = "Yeo-Johnson") %>%
bind_rows(data.frame(gather(select(bbbDescr, adistd, adistm, dpsa3, inthb), Predictor, Value), Transformation = "NA", stringsAsFactors = FALSE))
# For the predictor adistd, I would like the x-axis range to be 0:12.5 for the
# "Yeo-Johnson" transformation and 0:250 for no transformation. In this plot, it
# is hard to see the shape of the transformed variables due to the different x-value range.
dat %>% ggplot(aes(x = Value, color = Transformation)) +
geom_density(aes(y = ..scaled..), position = "dodge") +
facet_wrap(~Predictor, scales = "free")
# i.e., I want to superimpose the 2 charts for a given Predictor on top of each other, even though the x-axis is different
# I find it hard to look across the two facets. In reality, as well, there will be a lot more plots, so combining the non-transformed and transformed data into the one plot using colour would be the best solution.
filter(dat, Transformation != 'NA') %>% ggplot(aes(x = Value, y = ..scaled..)) +
geom_density() +
facet_wrap(~Predictor, scales = "free")
filter(dat, Transformation == 'NA') %>% ggplot(aes(x = Value, y = ..scaled..)) +
geom_density() +
facet_wrap(~Predictor, scales = "free")
Edit: The algorithm I think I need is (and prefer to do using tidyverse):
Group by predictor/transformation
Get density for each
Transform x of density to (x-xmin)/(xmax-xmin) so that between 0 to 1
Plot transformed density$x, density$y
Solution that scales (base::scale) and calculates density (stats::density). density function outputs same number of equally spaced points so we can arrange them from 0 to 1 (as OP wants).
# How many points we want
nPoints <- 1e3
# Final result
res <- list()
# Using simple loop to scale and calculate density
combinations <- expand.grid(unique(dat$Predictor), unique(dat$Transformation))
for(i in 1:nrow(combinations)) {
# Subset data
foo <- subset(dat, Predictor == combinations$Var1[i] & Transformation == combinations$Var2[i])
# Perform density on scaled signal
densRes <- density(x = scale(foo$Value), n = nPoints)
# Position signal from 1 to wanted number of points
res[[i]] <- data.frame(x = 1:nPoints, y = densRes$y,
pred = combinations$Var1[i], trans = combinations$Var2[i])
}
res <- do.call(rbind, res)
ggplot(res, aes(x / nPoints, y, color = trans, linetype = trans)) +
geom_line(alpha = 0.5, size = 1) +
facet_wrap(~ pred, scales = "free")