Plot Percentile Indication in R / GGPLOT2 - r

I have a basic plot of a two column data frame (x = "Periods" and y = "Range").
library (ggplot2)
qplot (Periods, Range, data=twocoltest, color=Periods, size = 3,) + geom_jitter(position=position_jitter(width=0.2))
I am trying to add a horizontal line at each period below which lie 90% of all the observations for that period. (It doesn't have to be a horizontal line, any visual indication per period would suffice).
Any help would be greatly appreciated.

Alrighty, I've read the ggplot help, and here's a go:
# example data
twocoltest <- data.frame(Periods=rep(1:3,each=3),Range=1:9)
library(ggplot2)
c <- qplot (Periods, Range, data=twocoltest, color=Periods, size = 3,) + geom_jitter(position=position_jitter(width=0.2))
q90 <- function(x) {quantile(x,probs=0.9)}
c + stat_summary(fun.y=q90, colour="red", geom="crossbar", size = 1, ymin=0, ymax=0)

Related

Histograms in R with ggplot2 (axis ticks,breaks)

Hello everyone I have some data with which I need to create a nice histogram.
Firstly I used the hist() to create a base one and after researching I found out that it uses the sturges method to count how many bins will be needed. In order to make a more customizable and good-looking histogram, I tried using the ggplot package and manually entering the number of bins I need. As you can see in the photos the histograms are not the same cause on the y-axis using hist()it reaches up to 60 freq while with the ggplot it surpasses that.
Additionally, I'm having a hard time getting the ggplot to show proper ticks on the X I can't find any reference on how to mod the tick marks so that they align with the breaks without messing up the graph.
Any ideas and help would be really appreciated.
Photos:
https://prnt.sc/greVRNoGo67T
https://prnt.sc/bMl29-2Fr5BN
One way to solve the problem is to do some pre-processing and plot a bar plot.
The pre-processing is to bin the data with cut. This transforms the continuous variable Total_Volume in a categorical variable but since it is done in a pipe the transformation is temporary, the original data remains unchanged.
The breaks are equidistant and the labels are the mid point values rounded.
Note that the very small counts are plotted with bars that are not visible. But they are there.
suppressPackageStartupMessages({
library(dplyr)
library(ggplot2)
})
brks <- seq(min(x) - 1, max(x) + 1, length.out = 30)
labs <- round(brks[-1] + diff(brks)/2)
data.frame(Total_Volume = x) %>%
mutate(Total_Volume = cut(x, breaks = brks, labels = labs)) %>%
ggplot(aes(Total_Volume)) +
geom_bar(width = 1) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Created on 2022-10-02 with reprex v2.0.2
Data creation code.
set.seed(2022)
n <- 1e6
x <- rchisq(n, df = 1, ncp = 5.5e6)
i <- x > 5.5e6
x[i] <- rnorm(sum(i), mean = 5.5e6, sd = 1e4)
Created on 2022-10-02 with reprex v2.0.2

Why does specifying fill in an Aesthetic mapping change the figure in the plot

When trying highlight a part of a plot, I got an output I didn't expect.
This is the code I'm using to plot the density function of student grades from my dataset.
grades <- student_data$G3
q_aprox = function(x) return (qnorm(x, mean(grades), sd(grades)))
ggplot(student_data, aes(x = G3)) +
# -- IMPORTANT PART BEGIN -- #
geom_density(
color = 'steelblue',
alpha = 0.3,
position = 'stack'
) +
geom_density(
aes(fill = q_aprox(0.025) < G3 & G3 < q_aprox(0.975)),
alpha = 0.3,
position = 'stack'
) + theme_minimal()
# -- IMPORTANT PART END -- #
Unexpectedly, the plot I got from the first geom_density is different than the one I got from the second geom_density. I expected that, since the x and y mappings are left untouched, the plots would be the same.
Why doesn't this happen?
grades, or student_data$G3, is a numeric vector of size 395 with discrete values from 0 up to 20.
Here's the plot that's produced from the previous code
Output Plot - Not enough reputation to post images, sorry
The left tail on the second call is bigger than the one on the first. Also, the output in general seems to be "more spiked".
I recently watched part 1 of ggplot2's workshop on YouTube in preparation for this college assignment. That's more or less my knowledge level regarding ggplot2.
Specifying the fill aesthetic with a discrete variable triggers heuristics in ggplot2 that will automatically group your data. The densities are calculed per group, and densities integrate to 1. Therefore, if you calculate two densities of two groups of unequal sizes, densities still integrate to 1, so the the area of the densities does not reflect the unequal group sizes.
Below is an example of two groups, wherein group A is 10x as large as group B and the groups have different means. You'll notice that if we don't group the data, the resulting density peaks at -1: the center/mean of group A. However, when we auto-group the data with the fill aesthetic, both densities will peak at their own means, but the area of group B is as large as group A (it continuous behind the blue/green density).
library(ggplot2)
library(patchwork)
df <- data.frame(
x = c(rnorm(1000, -1), rnorm(100, 1)),
group = rep(c("A", "B"), c(1000, 100))
)
g1 <- ggplot(df, aes(x)) +
geom_density()
g2 <- ggplot(df, aes(x, fill = group)) +
geom_density()
g1 | g2
If you want to retain proportions to the group sizes, you can use y = after_stat(count) to use the computed variable count, which is the density estimate (which integrates to 1) times the number of observations. You can read about computed variables in the documentation under the header "computed variables" in for example ?geom_density.
ggplot(df, aes(x, fill = group)) +
geom_density(aes(y = after_stat(count)))
Created on 2021-05-12 by the reprex package (v0.3.0)

How to change origin line position in ggplot bar graph?

Say I'm measuring 10 personality traits and I know the population baseline. I would like to create a chart for individual test-takers to show them their individual percentile ranking on each trait. Thus, the numbers go from 1 (percentile) to 99 (percentile). Given that a 50 is perfectly average, I'd like the graph to show bars going to the left or right from 50 as the origin line. In bar graphs in ggplot, it seems that the origin line defaults to 0. Is there a way to change the origin line to be at 50?
Here's some fake data and default graphing:
df <- data.frame(
names = LETTERS[1:10],
factor = round(rnorm(10, mean = 50, sd = 20), 1)
)
library(ggplot2)
ggplot(data = df, aes(x=names, y=factor)) +
geom_bar(stat="identity") +
coord_flip()
Picking up on #nongkrong's comment, here's some code that will do what I think you want while relabeling the ticks to match the original range and relabeling the axis to avoid showing the math:
library(ggplot2)
ggplot(data = df, aes(x=names, y=factor - 50)) +
geom_bar(stat="identity") +
scale_y_continuous(breaks=seq(-50,50,10), labels=seq(0,100,10)) + ylab("Percentile") +
coord_flip()
This post was really helpful for me - thanks #ulfelder and #nongkrong. However, I wanted to re-use the code on different data without having to manually adjust the tick labels to fit the new data. To do this in a way that retained ggplot's tick placement, I defined a tiny function and called this function in the label argument:
fix.labels <- function(x){
x + 50
}
ggplot(data = df, aes(x=names, y=factor - 50)) +
geom_bar(stat="identity") +
scale_y_continuous(labels = fix.labels) + ylab("Percentile") +
coord_flip()

Symmetrical histograms

I want to make a number of symmetrical histograms to show butterfly abundance through time. Here's a site that shows the form of the graphs I am trying to create: http://thebirdguide.com/pelagics/bar_chart.htm
For ease, I will use the iris dataset.
library(ggplot2)
g <- ggplot(iris, aes(Sepal.Width)) + geom_histogram(binwidth=.5)
g + coord_fixed(ratio = .003)
Essentially, I would like to mirror this histogram below the x-axis. Another way of thinking about the problem is to create a horizontal violin diagram with distinct bins. I've looked at the plotrix package and the ggplot2 documentation but don't find a solution in either place. I prefer to use ggplot2 but other solutions in base R, lattice or other packages will be fine.
Without your exact data, I can only provide an approximate coding solution, but it is a start for you (if you add more details, I'll be happy to help you tweak the plot). Here's the code:
library(ggplot2)
noSpp <- 3
nTime <- 10
d <- data.frame(
JulianDate = rep(1:nTime , times = noSpp),
sppAbundance = c(c(1:5, 5:1),
c(3:5, 5:1, 1:2),
c(5:1, 1:5)),
yDummy = 1,
sppName = rep(letters[1:noSpp], each = nTime))
ggplot(data = d, aes(x = JulianDate, y = yDummy, size = sppAbundance)) +
geom_line() + facet_grid( sppName ~ . ) + ylab("Species") +
xlab("Julian Date")
And here's the figure.

How to shade part of a density curve in ggplot (with no y axis data)

I'm trying to create a density curve in R using a set of random numbers between 1000, and shade the part that is less than or equal to a certain value. There are a lot of solutions out there involving geom_area or geom_ribbon, but they all require a yval, which I don't have (it's just a vector of 1000 numbers). Any ideas on how I could do this?
Two other related questions:
Is it possible to do the same thing for a cumulative density function (I'm currently using stat_ecdf to generate one), or shade it at all?
Is there any way to edit geom_vline so it will only go up to the height of the density curve, rather than the whole y axis?
Code: (the geom_area is a failed attempt to edit some code I found. If I set ymax manually, I just get a column taking up the whole plot, instead of just the area under the curve)
set.seed(100)
amount_spent <- rnorm(1000,500,150)
amount_spent1<- data.frame(amount_spent)
rand1 <- runif(1,0,1000)
amount_spent1$pdf <- dnorm(amount_spent1$amount_spent)
mean1 <- mean(amount_spent1$amount_spent)
#density/bell curve
ggplot(amount_spent1,aes(amount_spent)) +
geom_density( size=1.05, color="gray64", alpha=.5, fill="gray77") +
geom_vline(xintercept=mean1, alpha=.7, linetype="dashed", size=1.1, color="cadetblue4")+
geom_vline(xintercept=rand1, alpha=.7, linetype="dashed",size=1.1, color="red3")+
geom_area(mapping=aes(ifelse(amount_spent1$amount_spent > rand1,amount_spent1$amount_spent,0)), ymin=0, ymax=.03,fill="red",alpha=.3)+
ylab("")+
xlab("Amount spent on lobbying (in Millions USD)")+
scale_x_continuous(breaks=seq(0,1000,100))
There are a couple of questions that show this ... here and here, but they calculate the density prior to plotting.
This is another way, more complicated than required im sure, that allows ggplot to do some of the calculations for you.
# Your data
set.seed(100)
amount_spent1 <- data.frame(amount_spent=rnorm(1000, 500, 150))
mean1 <- mean(amount_spent1$amount_spent)
rand1 <- runif(1,0,1000)
Basic density plot
p <- ggplot(amount_spent1, aes(amount_spent)) +
geom_density(fill="grey") +
geom_vline(xintercept=mean1)
You can extract the x and y positions for the area to shade from the plot object using ggplot_build. Linear interpolation was used to get the y value at x=rand1
# subset region and plot
d <- ggplot_build(p)$data[[1]]
p <- p + geom_area(data = subset(d, x > rand1), aes(x=x, y=y), fill="red") +
geom_segment(x=rand1, xend=rand1,
y=0, yend=approx(x = d$x, y = d$y, xout = rand1)$y,
colour="blue", size=3)

Resources