Plotting the area under the curve of various distributions in R

Plotting the area under the curve of various distributions in R - r

Suppose I'm trying to find the area below a certain value for a student t distribution. I calculate my t test statistic to be t=1.78 with 23 degrees of freedom, for example. I know how to get the area under the curve above t=1.78 with the pt() function. How can I get a plot of the student distribution with 23 degrees of freedom and the area under the curve above 1.78 shaded in. That is, I want the curve for pt(1.78,23,lower.tail=FALSE) plotted with the appropriate area shaded. Is there a way to do this?

ggplot version:
ggplot(data.frame(x = c(-4, 4)), aes(x)) +
stat_function(fun = dt, args =list(df =23)) +
stat_function(fun = dt, args =list(df =23),
xlim = c(1.78,4),
geom = "area")

This should work:
x_coord <- seq(-5, 5, length.out = 200) # x-coordinates
plot(x_coord, dt(x_coord, 23), type = "l",
xlab = expression(italic(t)), ylab = "Density", bty = "l") # plot PDF
polygon(c(1.78, seq(1.78, 5, by = .3), 5, 5), # polygon for area under curve
c(0, dt(c(seq(1.78, 5, by = .3), 5), 23), 0),
col = "red", border = NA)
Regarding arguments to polygon():
your first and last points should be [1.78, 0] and [5, 0] (5 only in case the plot goes to 5) - these basically devine the bottom edge of the red polygon
2nd and penultimate points are [1.78, dt(1.78, 23)] and [5, dt(5, 23)] - these define the end points of the upper edge
the stuff in between is just X and Y coordinates of an arbitrary number of points along the curve [x, dt(x, 23)] - the more points, the smoother the polygon
Hope this helps

Related

Density plot of the F-distribution (df1=1). Theoretical or simulated?

I am plotting the density of F(1,49) in R. It seems that the simulated plot does not match the theoretical plot when values approach the zero.
set.seed(123)
val <- rf(1000, df1=1, df2=49)
plot(density(val), yaxt="n",ylab="",xlab="Observation",
main=expression(paste("Density plot (",italic(n),"=1000, ",italic(df)[1],"=1, ",italic(df)[2],"=49)")),
lwd=2)
curve(df(x, df1=1, df2=49), from=0, to=10, add=T, col="red",lwd=2,lty=2)
legend("topright",c("Theoretical","Simulated"),
col=c("red","black"),lty=c(2,1),bty="n")

Using density(val, from = 0) gets you much closer, although still not perfect. Densities near boundaries are notoriously difficult to calculate in a satisfactory way.

By default, density uses a Gaussian kernel to estimate the probability density at a given point. Effectively, this means that at each point an observation was found, a normal density curve is placed there with its center at the observation. All these normal densities are added up, then the result is normalized so that the area under the curve is 1.
This works well if observations have a central tendency, but gives unrealistic results when there are sharp boundaries (Try plot(density(runif(1000))) for a prime example).
When you have a very high density of points close to zero, but none below zero, the left tail of all the normal kernels will "spill over" into the negative values, giving a Gaussian-type which doesn't match the theoretical density.
This means that if you have a sharp boundary at 0, you should remove values of your simulated density that are between zero and about two standard deviations of your smoothing kernel - anything below this will be misleading.
Since we can control the standard deviation of our smoothing kernel with the bw parameter of density, and easily control which x values are plotted using ggplot, we will get a more sensible result by doing something like this:
library(ggplot2)
ggplot(as.data.frame(density(val), bw = 0.1), aes(x, y)) +
geom_line(aes(col = "Simulated"), na.rm = TRUE) +
geom_function(fun = ~ df(.x, df1 = 1, df2 = 49),
aes(col = "Theoretical"), lty = 2) +
lims(x = c(0.2, 12)) +
theme_classic(base_size = 16) +
labs(title = expression(paste("Density plot (",italic(n),"=1000, ",
italic(df)[1],"=1, ",italic(df)[2],"=49)")),
x = "Observation", y = "") +
scale_color_manual(values = c("black", "red"), name = "")

The kde1d and logspline packages are not bad for such densities.
sims <- rf(1500, 1, 49)
library(kde1d)
kd <- kde1d(sims, bw = 1, xmin = 0)
plot(kd, col = "red", xlim = c(0, 2), ylim = c(0, 2))
curve(df(x, 1, 49), add = TRUE)
library(logspline)
fit <- logspline(sims, lbound = 0, knots = c(0, 0.5, 1, 1.5, 2))
plot(fit, col = "red", xlim = c(0, 2), ylim = c(0, 2))
curve(df(x, 1, 49), add = TRUE)

R -Smoothscatter plot curve and diagonal axis

I have file from that wanted plot the smoothscatterplot using R. plot must have the dots, diagonal axis and a curve for that I have formula, I am creating smoothscatterplot but not able plot diagonal and curve any suggestion and help will be appreciated
https://drive.google.com/file/d/1KknqYcRBCGm8Xrj1XKh3mE7rb7LK9iny/view?usp=sharing
what I tried
diagonal axis
df$P0+df$P2 =1
curve
p2 = (√df$P0 − 1)^2
df=read.table("scale_out",sep='\t', header=TRUE)
df = data.frame(df)
smoothScatter(df$P0,df$P2, cex=10)
what I got
what I want
Thank you

The data doesn't seem to contain what you think it contains.
For example, if we just do a straight plot of P0 and P2, we get this:
plot(df$P0, df$P2, pch = 18, cex = 0.5)
The desired plot that you show in your question suggests that this should be a scatter plot of noisy data, but it isn't. If we plot all the numeric variables in your data frame against each other, we get this:
The only plot here that looks like a scatter plot of the correct shape is P0 versus B.prop.
Assuming that this is what you want, you can create the desired plot like this:
smoothScatter(df$P0, df$B.prop, cex = 2, xlab = "P0", ylab = "P2")
curve((sqrt(x) - 1)^2, 0, 1, lty = 2, lwd = 5, col = "red", add = TRUE)
lines(0:1, 1:0, lty = 2, lwd = 5, col = "deepskyblue4")
legend(0.35, 1, c("Coordination", "Independence"),
col = c("deepskyblue4", "red"), bg = "#FFFFFFAA",
lty = 2, lwd = 5, box.col = "#FFFFFF00")

Vertical Histogram

I'd like to do a vertical histogram. Ideally I should be able to put multiple on a single plot per day.
If this could be combined with quantmod experimental chart_Series or some other library capable of drawing bars for a time series that would be great. Please see the attached screenshot. Ideally I could plot something like this.
Is there anything built in or existing libraries that can help with this?

I wrote something a year or so ago to do vertical histograms in base graphics. Here it is, with a usage example.
VerticalHist <- function(x, xscale = NULL, xwidth, hist,
fillCol = "gray80", lineCol = "gray40") {
## x (required) is the x position to draw the histogram
## xscale (optional) is the "height" of the tallest bar (horizontally),
## it has sensible default behavior
## xwidth (required) is the horizontal spacing between histograms
## hist (required) is an object of type "histogram"
## (or a list / df with $breaks and $density)
## fillCol and lineCol... exactly what you think.
binWidth <- hist$breaks[2] - hist$breaks[1]
if (is.null(xscale)) xscale <- xwidth * 0.90 / max(hist$density)
n <- length(hist$density)
x.l <- rep(x, n)
x.r <- x.l + hist$density * xscale
y.b <- hist$breaks[1:n]
y.t <- hist$breaks[2:(n + 1)]
rect(xleft = x.l, ybottom = y.b, xright = x.r, ytop = y.t,
col = fillCol, border = lineCol)
}
## Usage example
require(plyr) ## Just needed for the round_any() in this example
n <- 1000
numberOfHists <- 4
data <- data.frame(ReleaseDOY = rnorm(n, 110, 20),
bin = as.factor(rep(c(1, 2, 3, 4), n / 4)))
binWidth <- 1
binStarts <- c(1, 2, 3, 4)
binMids <- binStarts + binWidth / 2
axisCol <- "gray80"
## Data handling
DOYrange <- range(data$ReleaseDOY)
DOYrange <- c(round_any(DOYrange[1], 15, floor),
round_any(DOYrange[2], 15, ceiling))
## Get the histogram obects
histList <- with(data, tapply(ReleaseDOY, bin, hist, plot = FALSE,
breaks = seq(DOYrange[1], DOYrange[2], by = 5)))
DOYmean <- with(data, tapply(ReleaseDOY, bin, mean))
## Plotting
par(mar = c(5, 5, 1, 1) + .1)
plot(c(0, 5), DOYrange, type = "n",
ann = FALSE, axes = FALSE, xaxs = "i", yaxs = "i")
axis(1, cex.axis = 1.2, col = axisCol)
mtext(side = 1, outer = F, line = 3, "Length at tagging (mm)",
cex = 1.2)
axis(2, cex.axis = 1.2, las = 1, line = -.7, col = "white",
at = c(75, 107, 138, 169),
labels = c("March", "April", "May", "June"), tck = 0)
mtext(side = 2, outer = F, line = 3.5, "Date tagged", cex = 1.2)
box(bty = "L", col = axisCol)
## Gridlines
abline(h = c(60, 92, 123, 154, 184), col = "gray80")
biggestDensity <- max(unlist(lapply(histList, function(h){max(h[[4]])})))
xscale <- binWidth * .9 / biggestDensity
## Plot the histograms
for (lengthBin in 1:numberOfHists) {
VerticalHist(binStarts[lengthBin], xscale = xscale,
xwidth = binWidth, histList[[lengthBin]])
}

Violin plots might be close enough to what you want. They are density plots that have been mirrored through one axis, like a hybrid of a boxplot and a density plot. (Much easier to understanding by example than description. :-) )
Here is a simple (somewhat ugly) example of the ggplot2 implementation of them:
library(ggplot2)
library(lubridate)
data(economics) #sample dataset
# calculate year to group by using lubridate's year function
economics$year<-year(economics$date)
# get a subset
subset<-economics[economics$year>2003&economics$year<2007,]
ggplot(subset,aes(x=date,y=unemploy))+
geom_line()+geom_violin(aes(group=year),alpha=0.5)
A prettier example would be:
ggplot(subset,aes(x=date,y=unemploy))+
geom_violin(aes(group=year,colour=year,fill=year),alpha=0.5,
kernel="rectangular")+ # passes to stat_density, makes violin rectangular
geom_line(size=1.5)+ # make the line (wider than normal)
xlab("Year")+ # label one axis
ylab("Unemployment")+ # label the other
theme_bw()+ # make white background on plot
theme(legend.position = "none") # suppress legend
To include ranges instead of or in addition to the line, you would use geom_linerange or geom_pointrange.

If you use grid graphics then you can create rotated viewports whereever you want them and plot to the rotated viewport. You just need a function that will plot using grid graphics into a specified viewport, I would suggest ggplot2 or possibly lattice for this.
In base graphics you could write your own function to plot the rotated histogram (modify the plot.histogram function or just write your own from scratch using rect or other tools). Then you can use the subplot function from the TeachingDemos package to place the plot wherever you want on a larger plot.

How to plot a normal distribution by labeling specific parts of the x-axis?

I am using the following code to create a standard normal distribution in R:
x <- seq(-4, 4, length=200)
y <- dnorm(x, mean=0, sd=1)
plot(x, y, type="l", lwd=2)
I need the x-axis to be labeled at the mean and at points three standard deviations above and below the mean. How can I add these labels?

The easiest (but not general) way is to restrict the limits of the x axis. The +/- 1:3 sigma will be labeled as such, and the mean will be labeled as 0 - indicating 0 deviations from the mean.
plot(x,y, type = "l", lwd = 2, xlim = c(-3.5,3.5))
Another option is to use more specific labels:
plot(x,y, type = "l", lwd = 2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))

Using the code in this answer, you could skip creating x and just use curve() on the dnorm function:
curve(dnorm, -3.5, 3.5, lwd=2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
But this doesn't use the given code anymore.

If you like hard way of doing something without using R built in function or you want to do this outside R, you can use the following formula.
x<-seq(-4,4,length=200)
s = 1
mu = 0
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2))
plot(x,y, type="l", lwd=2, col = "blue", xlim = c(-3.5,3.5))

An extremely inefficient and unusual, but beautiful solution, which works based on the ideas of Monte Carlo simulation, is this:
simulate many draws (or samples) from a given distribution (say the normal).
plot the density of these draws using rnorm. The rnorm function takes as arguments (A,B,C) and returns a vector of A samples from a normal distribution centered at B, with standard deviation C.
Thus to take a sample of size 50,000 from a standard normal (i.e, a normal with mean 0 and standard deviation 1), and plot its density, we do the following:
x = rnorm(50000,0,1)
plot(density(x))
As the number of draws goes to infinity this will converge in distribution to the normal. To illustrate this, see the image below which shows from left to right and top to bottom 5000,50000,500000, and 5 million samples.

In general case, for example: Normal(2, 1)
f <- function(x) dnorm(x, 2, 1)
plot(f, -1, 5)
This is a very general, f can be defined freely, with any given parameters, for example:
f <- function(x) dbeta(x, 0.1, 0.1)
plot(f, 0, 1)

I particularly love Lattice for this goal. It easily implements graphical information such as specific areas under a curve, the one you usually require when dealing with probabilities problems such as find P(a < X < b) etc.
Please have a look:
library(lattice)
e4a <- seq(-4, 4, length = 10000) # Data to set up out normal
e4b <- dnorm(e4a, 0, 1)
xyplot(e4b ~ e4a, # Lattice xyplot
type = "l",
main = "Plot 2",
panel = function(x,y, ...){
panel.xyplot(x,y, ...)
panel.abline( v = c(0, 1, 1.5), lty = 2) #set z and lines
xx <- c(1, x[x>=1 & x<=1.5], 1.5) #Color area
yy <- c(0, y[x>=1 & x<=1.5], 0)
panel.polygon(xx,yy, ..., col='red')
})
In this example I make the area between z = 1 and z = 1.5 stand out. You can move easily this parameters according to your problem.
Axis labels are automatic.

This is how to write it in functions:
normalCriticalTest <- function(mu, s) {
x <- seq(-4, 4, length=200) # x extends from -4 to 4
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2)) # y follows the formula
of the normal distribution: f(Y)
plot(x,y, type="l", lwd=2, xlim = c(-3.5,3.5))
abline(v = c(-1.96, 1.96), col="red") # draw the graph, with 2.5% surface to
either side of the mean
}
normalCriticalTest(0, 1) # draw a normal distribution with vertical lines.
Final result:

Howto Plot ROC curve in R with only known SN/PPV/Cutoff info

Given such data:
#Cutpoint SN (1-PPV)
5 0.56 0.01
7 0.78 0.19
9 0.91 0.58
How can I plot ROC curve with R that produce similar result like the
attached ?
I know ROCR package but it doesn't take such input.

If you just want to create the plot (without that silly interpolation spline between points) then just plot the data you give in the standard way, prepending a point at (0,0) and appending one at (1,1) to give the end points of the curve.
## your data with different labels
dat <- data.frame(cutpoint = c(5, 7, 9),
TPR = c(0.56, 0.78, 0.91),
FPR = c(0.01, 0.19, 0.58))
## plot version 1
op <- par(xaxs = "i", yaxs = "i")
plot(TPR ~ FPR, data = dat, xlim = c(0,1), ylim = c(0,1), type = "n")
with(dat, lines(c(0, FPR, 1), c(0, TPR, 1), type = "o", pch = 25, bg = "black"))
text(TPR ~ FPR, data = dat, pos = 3, labels = dat$cutpoint)
abline(0, 1)
par(op)
To explain the code: The first plot() call sets up the plotting region, without doing an plotting at all. Note that I force the plot to cover the range (0,1) in both axes. The par() call tells R to plot axes that cover the range of the data - the default extends them by 4 percent of the range on each axis.
The next line, with(dat, lines(....)) draws the ROC curve and here we prepend and append the points at (0,0) and (1,1) to give the full curve. Here I use type = "o" to give both points and lines overplotted, the points are represented by character 25 which allows it to be filled with a colour, here black.
Then I add labels to the points using text(....); the pos argument is used to position the label away from the actual plotting coordinates. I take the labels from the cutpoint object in the data frame.
The abline() call draws the 1:1 line (here the 0, and 1 mean an intercept of 0 and a slope of 1 respectively.
The final line resets the plotting parameters to the defaults we saved in op prior to plotting (in the first line).
The resulting plot looks like this:
It isn't an exact facsimile and I prefer the plot using the default for the axis ranges(adding 4 percent):
plot(TPR ~ FPR, data = dat, xlim = c(0,1), ylim = c(0,1), type = "n")
with(dat, lines(c(0, FPR, 1), c(0, TPR, 1), type = "o", pch = 25, bg = "black"))
text(TPR ~ FPR, data = dat, pos = 3, labels = dat$cutpoint)
abline(0, 1)
Again, not a true facsimile but close.