Histogram y axis scaled to 1 inside scatterplot with greater y-axis - r

Alternate title: How can I scale my y-axis for the histogram only to range 0-1?
Horrible question title, so example to demonstrate. The data here are set so that the ranges are nearly equal to my data ranges on the y-axis... about 0 to 3.5.
library(ggplot2)
x<-runif(100)*200
y<-runif(100)*3
xy<-data.frame(x,y)
p <- ggplot(xy) + theme_bw()
p + geom_point(aes(x, y)) +
geom_histogram(aes(x), alpha=1/10)
I want the histogram 'y-range' to be scaled to a max of 1. The first part of this answer shows an example, saying:
You were close, but need to use (..density..)*binwidth rather than
..count../sum(..count..)
# Your data:
all <- data.frame(fill=rep(LETTERS[1:4],c(26,24,23,29)),
Events=c(1,1,3,1,1,6,2,1,1,2,1,1,1,1,5,1,2,2,1,1,1,1,2,1,2,1,2,3,1,3,2,5,1,1,1,2,1,1,1,1,1,1,1,1,1,4,3,3,5,3,1,2,2,3,3,9,8,1,1,2,2,1,2,39,43,194,129,186,1,2,7,4,1,12,3,2,3,8,20,5,1,4,9,51,12,7,6,7,7,9,17,18,8,7,6,10,27,11,21,89,47,1))
bw <- 20 # set the binwidth
# plot
p1<-ggplot(all,aes(x=Events, fill=fill)) +
geom_histogram(aes(y=(..density..)*bw), position='dodge', binwidth=bw)
p1
but it doesn't work for me, failing with an error about there being no variable 'bw':
bw <- 30
p <- ggplot(xy) + theme_bw()
p + geom_point(aes(x, y)) +
geom_histogram(aes(x=x, y=..density.. * bw), alpha=1/10)
Error in eval(expr, envir, enclos) : object 'bw' not found

Goodness me, I found the notation I needed...
y=..ncount..
From: Normalizing y-axis in histograms in R ggplot to proportion

Related

ggplot set ylim min OR max

I am trying to make a plot where I could just specify the min or the max value for the y axis. But I get Error in if (zero_range(range)) { : missing value where TRUE/FALSE needed
From the documentation:
You can leave one value as NA to compute from the range of the data.
Thus, I did:
#Getting some data in
plot <- ggplot(mydata,
aes_string(y="yvar", x="xvar", colour="group", group="group", fill="group")
)
#Adding some error bars
plot <- plot + geom_errorbar(aes(ymax=agg+var, ymin=agg-var), size=0.5, colour="black", data=mydata)
plot <- plot + geom_point(size=4)
plot <- plot + geom_line(size=1)
#Here is when I just want to set y max
plot <- plot + coord_cartesian(ylim= c(NA, 100))
If I remove the ylim or change the NA to a numeric value, it works well. What am I missing here?
You can use expand limits to extend the axis in only one direction. For example:
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
expand_limits(y=c(NA, 50))
For your example, it would be:
plot + expand_limits(y=c(NA, 100))
You can even provide a single value. If that value is greater than the maximum of the data, it will expand the maximum. If lower than the minimum of the data, it will expand the minimum. In your example:
plot + expand_limits(y=100)
And here are two reproducible examples:
p = ggplot(mtcars, aes(wt, mpg)) +
geom_point()
p + expand_limits(y=-20)
p + expand_limits(y=200)

Let ggplot2 histogram show classwise percentages on y axis

library(ggplot2)
data = diamonds[, c('carat', 'color')]
data = data[data$color %in% c('D', 'E'), ]
I would like to compare the histogram of carat across color D and E, and use the classwise percentage on the y-axis. The solutions I have tried are as follows:
Solution 1:
ggplot(data=data, aes(carat, fill=color)) + geom_bar(aes(y=..density..), position='dodge', binwidth = 0.5) + ylab("Percentage") +xlab("Carat")
This is not quite right since the y-axis shows the height of the estimated density.
Solution 2:
ggplot(data=data, aes(carat, fill=color)) + geom_histogram(aes(y=(..count..)/sum(..count..)), position='dodge', binwidth = 0.5) + ylab("Percentage") +xlab("Carat")
This is also not I want, because the denominator used to calculate the ratio on the y-axis are the total count of D + E.
Is there a way to display the classwise percentages with ggplot2's stacked histogram? That is, instead of showing (# of obs in bin)/count(D+E) on y axis, I would like it to show (# of obs in bin)/count(D) and (# of obs in bin)/count(E) respectively for two color classes. Thanks.
Calculating from stats
You can scale them by group by using the special stat variables group and count, using group to select subsets of count.
If you have ggplot 3.3.0 or newer, you can use the after_stat function to access these special variables:
ggplot(data, aes(carat, fill=color)) +
geom_histogram(
aes(y=after_stat(c(
count[group==1]/sum(count[group==1]),
count[group==2]/sum(count[group==2])
)*100)),
position='dodge',
binwidth=0.5
) +
ylab("Percentage") + xlab("Carat")
Using older versions of ggplot
In earlier versions, this is more cumbersome - if you have at least 3.0 you can wrap stat() function around each individual variable reference, in pre-3.0 versions you have to surround them with two dots instead:
aes(y=c(
..count..[..group..==1]/sum(..count..[..group..==1]),
..count..[..group..==2]/sum(..count..[..group..==2])
)*100),
Yeah but what are all these stats?
For more details on where these variables come from, summary stats will be documented alongside the stat function being used - for example geom_histogram's default stat_bin() has this Computed variables section:
Computed variables:
count number of points in bin
density density of points in bin, scaled to integrate to 1
ncount count, scaled to maximum of 1
ndensity density, scaled to maximum of 1
width widths of bins
Beyond that, you can use ggplot_build() to inspect all the stats generated for any given plot:
> p = ggplot(data, [...etc...])
> ggplot_build(p)
$data
$data[[1]]
fill y count x xmin xmax density ncount
1 #440154FF 1.50553506 102 -0.125 -0.25 0.00 0.0301107011 0.0224323730
2 #440154FF 67.11439114 4547 0.375 0.25
[...snip...]
ndensity flipped_aes PANEL group ymin ymax colour size linetype
1 0.0224323730 FALSE 1 1 0 1.50553506 NA 0.5 1
2 1.0000000000 FALSE 1 1 0 67.11439114 NA 0.5 1
[...snip...]
It seems that binning the data outside of ggplot2 is the way to go. But I would still be interested to see if there is a way to do it with ggplot2.
library(dplyr)
breaks = seq(0,4,0.5)
data$carat_cut = cut(data$carat, breaks = breaks)
data_cut = data %>%
group_by(color, carat_cut) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
ggplot(data=data_cut, aes(x = carat_cut, y=freq*100, fill=color)) + geom_bar(stat="identity",position="dodge") + scale_x_discrete(labels = breaks) + ylab("Percentage") +xlab("Carat")
Fortunately, in my case, Rorschach's answer worked perfectly. I was here looking to avoid the solution proposed by Megan Halbrook, which is the one I was using until I realized it is not a correct solution.
Adding a density line to the histogram automatically change the y axis to frequency density, not to percentage. The values of frequency density would be equivalent to percentages only if binwidth = 1.
Googling: To draw a histogram, first find the class width of each category. The area of the bar represents the frequency, so to find the height of the bar, divide frequency by the class width. This is called frequency density. https://www.bbc.co.uk/bitesize/guides/zc7sb82/revision/9
Below an example, where the left panel shows percentage and the right panel shows density for the y axis.
library(ggplot2)
library(gridExtra)
TABLE <- data.frame(vari = c(0,1,1,2,3,3,3,4,4,4,5,5,6,7,7,8))
## selected binwidth
bw <- 2
## plot using count
plot_count <- ggplot(TABLE, aes(x = vari)) +
geom_histogram(aes(y = ..count../sum(..count..)*100), binwidth = bw, col =1)
## plot using density
plot_density <- ggplot(TABLE, aes(x = vari)) +
geom_histogram(aes(y = ..density..), binwidth = bw, col = 1)
## visualize together
grid.arrange(ncol = 2, grobs = list(plot_count,plot_density))
## visualize the values
data_count <- ggplot_build(plot_count)
data_density <- ggplot_build(plot_density)
## using ..count../sum(..count..) the values of the y axis are the same as
## density * bindwidth * 100. This is because density shows the "frequency density".
data_count$data[[1]]$y == data_count$data[[1]]$density*bw * 100
## using ..density.. the values of the y axis are the "frequency densities".
data_density$data[[1]]$y == data_density$data[[1]]$density
## manually calculated percentage for each range of the histogram. Note
## geom_histogram use right-closed intervals.
min_range_of_intervals <- data_count$data[[1]]$xmin
for(i in min_range_of_intervals)
cat(paste("Values >",i,"and <=",i+bw,"involve a percent of",
sum(TABLE$vari>i & TABLE$vari<=(i+bw))/nrow(TABLE)*100),"\n")
# Values > -1 and <= 1 involve a percent of 18.75
# Values > 1 and <= 3 involve a percent of 25
# Values > 3 and <= 5 involve a percent of 31.25
# Values > 5 and <= 7 involve a percent of 18.75
# Values > 7 and <= 9 involve a percent of 6.25
When I tried Rorschach's answer it wasn't working for me for reasons that weren't readily apparent but I wanted to comment to say if you were open to adding density lines to a histogram once you do that it will automatically change the y axis to percent.
For example I have a count of "doses" by a binary outcome (0,1)
this code produces the following graph:
ggplot(data, aes(x=siadoses, fill=recallbin, color=recallbin)) +
geom_histogram(binwidth=1, alpha=.5, position='identity')
But when I include a density plot to my ggplot code and add y=..density.. I get this plot with percent on the Y
ggplot(data, aes(x=siadoses, fill=recallbin, color=recallbin)) +
geom_histogram(aes(y=..density..), binwidth=1, alpha=.5, position='identity') +
geom_density(alpha=.2)
kind of a work around to your original question but thought I would share.

How to shade part of a density curve in ggplot (with no y axis data)

I'm trying to create a density curve in R using a set of random numbers between 1000, and shade the part that is less than or equal to a certain value. There are a lot of solutions out there involving geom_area or geom_ribbon, but they all require a yval, which I don't have (it's just a vector of 1000 numbers). Any ideas on how I could do this?
Two other related questions:
Is it possible to do the same thing for a cumulative density function (I'm currently using stat_ecdf to generate one), or shade it at all?
Is there any way to edit geom_vline so it will only go up to the height of the density curve, rather than the whole y axis?
Code: (the geom_area is a failed attempt to edit some code I found. If I set ymax manually, I just get a column taking up the whole plot, instead of just the area under the curve)
set.seed(100)
amount_spent <- rnorm(1000,500,150)
amount_spent1<- data.frame(amount_spent)
rand1 <- runif(1,0,1000)
amount_spent1$pdf <- dnorm(amount_spent1$amount_spent)
mean1 <- mean(amount_spent1$amount_spent)
#density/bell curve
ggplot(amount_spent1,aes(amount_spent)) +
geom_density( size=1.05, color="gray64", alpha=.5, fill="gray77") +
geom_vline(xintercept=mean1, alpha=.7, linetype="dashed", size=1.1, color="cadetblue4")+
geom_vline(xintercept=rand1, alpha=.7, linetype="dashed",size=1.1, color="red3")+
geom_area(mapping=aes(ifelse(amount_spent1$amount_spent > rand1,amount_spent1$amount_spent,0)), ymin=0, ymax=.03,fill="red",alpha=.3)+
ylab("")+
xlab("Amount spent on lobbying (in Millions USD)")+
scale_x_continuous(breaks=seq(0,1000,100))
There are a couple of questions that show this ... here and here, but they calculate the density prior to plotting.
This is another way, more complicated than required im sure, that allows ggplot to do some of the calculations for you.
# Your data
set.seed(100)
amount_spent1 <- data.frame(amount_spent=rnorm(1000, 500, 150))
mean1 <- mean(amount_spent1$amount_spent)
rand1 <- runif(1,0,1000)
Basic density plot
p <- ggplot(amount_spent1, aes(amount_spent)) +
geom_density(fill="grey") +
geom_vline(xintercept=mean1)
You can extract the x and y positions for the area to shade from the plot object using ggplot_build. Linear interpolation was used to get the y value at x=rand1
# subset region and plot
d <- ggplot_build(p)$data[[1]]
p <- p + geom_area(data = subset(d, x > rand1), aes(x=x, y=y), fill="red") +
geom_segment(x=rand1, xend=rand1,
y=0, yend=approx(x = d$x, y = d$y, xout = rand1)$y,
colour="blue", size=3)

ggplot2 using geom_errorbar and geom_point to add points to a plot

I have a plot using ggplot, and I would like to add points and error bars to it. I am using geom_errorbar and geom_point, but I am getting an error: "Discrete value supplied to continuous scale" and I am not sure why. The data labels in the plot below should remain the same. I simply want to add new points to the existing graph. The new graph should look like the one below, except with two points/CI bars for each label on the Y axis.
The following example is from the lme4 package, and it produces a plot with confidence intervals using ggplot below (all can be replicated except the last two lines of borken code). My data is only different in that it includes about 15 intercepts instead of 6 below (which is why I am using scale_shape_manual).
The last two lines of code is my attempt at adding points/confidence intervals. I'm going to put a 50 bounty on this. Please let me know if I am being unclear. Thanks!
library("lme4")
data(package = "lme4")
# Dyestuff
# a balanced one-way classiï¬cation of Yield
# from samples produced from six Batches
summary(Dyestuff)
# Batch is an example of a random effect
# Fit 1-way random effects linear model
fit1 <- lmer(Yield ~ 1 + (1|Batch), Dyestuff)
summary(fit1)
coef(fit1) #intercept for each level in Batch
randoms<-ranef(fit1, postVar = TRUE)
qq <- attr(ranef(fit1, postVar = TRUE)[[1]], "postVar")
rand.interc<-randoms$Batch
#THESE ARE THE ADDITIONAL POINTS TO BE ADDED TO THE PLOT
Inter <- c(-25,-45,20,30,23,67)
SE2 <- c(20,20,20,20,20,20)
df<-data.frame(Intercepts=randoms$Batch[,1],
sd.interc=2*sqrt(qq[,,1:length(qq)]), Intercepts2=Inter, sd.iterc2=SE2,
lev.names=rownames(rand.interc))
df$lev.names<-factor(df$lev.names,levels=df$lev.names[order(df$Intercepts)])
library(ggplot2)
p <- ggplot(df,aes(lev.names,Intercepts,shape=lev.names))
#Added horizontal line at y=0
#Includes first set of points/confidence intervals. This works without error
p <- p + geom_hline(yintercept=0) +geom_errorbar(aes(ymin=Intercepts-sd.interc, ymax=Intercepts+sd.interc), width=0,color="black") + geom_point(aes(size=2))
#Removed legends and with scale_shape_manual point shapes set to 1 and 16
p <- p + guides(size=FALSE,shape=FALSE) + scale_shape_manual(values=c(16,16,16,16,16,16))
#Changed appearance of plot (black and white theme) and x and y axis labels
p <- p + theme_bw() + xlab("Levels") + ylab("")
#Final adjustments of plot
p <- p + theme(axis.text.x=element_text(size=rel(1.2)),
axis.title.x=element_text(size=rel(1.3)),
axis.text.y=element_text(size=rel(1.2)),
panel.grid.minor=element_blank(),
panel.grid.major.x=element_blank())
#To put levels on y axis you just need to use coord_flip()
p <- p+ coord_flip()
print(p)
#####
# code for adding more plots, NOT working yet
p <- p +geom_errorbar(aes(ymin=Intercepts2-sd.interc2, ymax=Intercepts2+sd.interc2),
width=0,color="gray40", lty=1, size=1)
p <- p + geom_point(aes(Intercepts2, lev.names),size=0,pch=7)
First, in your data frame df and geom_errorbar() there are two different variables sd.iterc2 and sd.interc2. Changed also in df to sd.interc2.
For the last line of geom_point() you get the error because your x and y values are in wrong order. As your are using coord_flip() then x and y values should be placed in the same order as in original plot before coord_flip(), that is, lev.names as x, and Intercepts2 as y. Changed also size= to 5 for better illustration.
+ geom_point(aes(lev.names,Intercepts2),size=5,pch=7)
Update - adding legend
To add legend for the points of intercept types, one option is to reshape your data to long format and add new column with intercept types. Other option with your existing data is, first, remove shape=lev.names from ggplot() call. Then in both geom_point() calls add shape="somename" inside aes(). Then with scale_shape_manual() set shape values you need.
ggplot(df,aes(lev.names,Intercepts))+
geom_hline(yintercept=0) +
geom_errorbar(aes(ymin=Intercepts-sd.interc, ymax=Intercepts+sd.interc), width=0,color="black")+
geom_point(aes(shape="Intercepts"),size=5)+
theme_bw() + xlab("Levels") + ylab("")+
theme(axis.text.x=element_text(size=rel(1.2)),
axis.title.x=element_text(size=rel(1.3)),
axis.text.y=element_text(size=rel(1.2)),
panel.grid.minor=element_blank(),
panel.grid.major.x=element_blank())+
coord_flip()+
geom_errorbar(aes(ymin=Intercepts2-sd.interc2, ymax=Intercepts2+sd.interc2),
width=0,color="gray40", lty=1, size=1) +
geom_point(aes(lev.names,Intercepts2,shape="Intercepts2"),size=5)+
scale_shape_manual(values=c(16,7))

cumulative probability plot from frequency table

Is there any way to plot the cumulative probability from a frequency table? I mean a "smooth" version of it, similar to the way geom_density() plots.
So far, I managed to plot the individually calculated probabilities as points joined by lines, but it doesn't look very good.
I generate some test data:
set.seed(1)
x <- sort(sample(1:100, 20))
p <- runif(x); p <- cumsum(p)/sum(p)
table <- data.frame(x=x, prob=p)
You can use geom_smooth from the ggplot2 package.
require("ggplot2")
qplot(x=x, y=p, data=table, aes(ymin=0, ymax=1)) + ylab("ecf") +
geom_smooth(se=F, stat="smooth", method="loess", fullrange=T, fill="lightgrey", size=1)
As an alternative, an easy way to specifiy smoothing by a parameter try DeconCdf from the decon package:
require("decon")
plot(DeconCdf(x, sig=1))
If you want to use ggplot, you first have to transform the Decon function object in a data.frame.
f <- DeconCdf(x, sig=1)
m <- ggplot(data=data.frame(x=f$x, p=f$y), aes(x=x, y=p, ymin=0, ymax=1)) + ylab("ecf")
m + geom_line(size=1)
Use the sig-Parameter as your smoothing parameter:
f <- DeconCdf(x, sig=0.3)
m <- ggplot(data=data.frame(x=f$x, p=f$y), aes(x=x, y=p, ymin=0, ymax=1)) + ylab("ecf")
m + geom_line(size=1)
This version plots a histogram with a smoothed line from geom_density:
# Generate some data:
set.seed(28986)
x2 <- rweibull(100, 1, 1/2)
# Plot the points:
library(ggplot2)
library(scales)
ggplot(data.frame(x=x2),aes(x=x, y=1-cumsum(..count..)/sum(..count..))) +
geom_histogram(aes(fill=..count..)) +
geom_density(fill=NA, color="black", adjust=1/2) +
scale_y_continuous("Percent of units\n(equal to or larger than x)",labels=percent) +
theme_grey(base_size=18)
Note that I've used 1 - "cumulative probability" due to individual preference (I think it looks better and I'm accustomed to dealing with "reliability" metrics), but obviously that's just a preference that you could ignore by removing the 1- part in the aes.

Resources