How to shade a graph using curve() in R - r

I am plotting the standard normal distribution.
curve(dnorm(x), from=-4, to=4,
main = "The Standard Normal Distibution",
ylab = "Probability Density",
xlab = "X")
For pedagogical reasons, I want to shade the area below a certain quantile of my choice. How can I do this?

If you want to use curve and base plot, then you can write a little function yourself with polygon:
colorArea <- function(from, to, density, ..., col="blue", dens=NULL){
y_seq <- seq(from, to, length.out=500)
d <- c(0, density(y_seq, ...), 0)
polygon(c(from, y_seq, to), d, col=col, density=dens)
}
A little example follows:
curve(dnorm(x), from=-4, to=4,
main = "The Standard Normal Distibution",
ylab = "Probability Density",
xlab = "X")
colorArea(from=-4, to=qnorm(0.025), dnorm)
colorArea(from=qnorm(0.975), to=4, dnorm, mean=0, sd=1, col=2, dens=20)

We could use the following R code too, in order to shade the regions under the standard normal curve below a certain (given) quantile:
library(ggplot2)
z <- seq(-4,4,0.01)
fz <- dnorm(z)
q <- qnorm(0.1) # the quantile
x <- seq(-4, q, 0.01)
y <- c(dnorm(x), 0, 0)
x <- c(x, q, -4)
ggplot() + geom_line(aes(z, fz)) +
geom_polygon(data = data.frame(x=x, y=y), aes(x, y), fill='blue')

Related

GGplot second y axis without the transformation of y axis

Does any one know how do you apply this
set.seed(101)
x <- 1:10
y <- rnorm(10)
## second data set on a very different scale
z <- runif(10, min=1000, max=10000)
par(mar = c(5, 4, 4, 4) + 0.3) # Leave space for z axis
plot(x, y) # first plot
par(new = TRUE)
plot(x, z, type = "l", axes = FALSE, bty = "n", xlab = "", ylab = "")
axis(side=4, at = pretty(range(z)))
mtext("z", side=4, line=3)
but using ggplot.
In ggplot you can only create sec.axis() or dup.axis() using a transformation of y axis. What about a whole new independent y axis which will be applied only for z variable and the simple y axis to be applied for the y variable.
ggplot2::sec_axis provides only one mechanism for providing a second axis, and it took a lot of convincing to get that into the codebase. You are responsible for coming up with the transformation. This transform must be linear in some way, so if either axis needs to be non-linear (e.g., exponential, logarithmic, etc), then your expert math skills will be put to the test.
If you can use scales, then this process becomes trivial:
dat <- data.frame(x, y, z)
ggplot(dat, aes(x, y)) +
geom_point() +
geom_line(
aes(y = zmod),
data = ~ transform(., zmod = scales::rescale(z, range(y), range(z)))
) +
scale_y_continuous(
sec.axis = sec_axis(~ scales::rescale(., range(dat$z), range(dat$y)),
breaks = c(2000,4000,6000,8000))
)
Unless I've missed something (I just checked ggplot2-3.3.5's NEWS.md file), this has not changed.

Plotting two or more lines on a graph using loops

I am trying to plot two or more lines on the same graph using a loop. My plot is a population dynamic in which I want to repeatedly change the value of the starting population but keep all other parameters the same. I want to plot the different outcomes on one graph. Can anyone help?
Try the following:
library(ggplot2)
MAX.Y<-30
# year<-0:30
year<-1:30
rlp<-0.1
lp<-rep(0,MAX.Y)
lp[1]<-4000
K<-4000000
for(n in 1: (MAX.Y-1)) {lp[n+1]<-lp[n]+(rlp)*(1-lp[n]/K)*lp[n]}
# plot(lp~year, xlab="Time (years)", ylab="Population size", main=c(paste("B) Anchovy population growth"), paste ("in less productive environment")), col="darkorchid", type="l", cex.main=1.0)
sp<-rep(0,MAX.Y)
sp[1]<-100000
for(n in 1: (MAX.Y-1)) {sp[n+1]<-sp[n]+(rlp)*(1-sp[n]/K)*sp[n]}
# lines(sp~year, type="l", col="black")
data = data.frame(year=year,lp=lp, sp=sp)
data = reshape2::melt(data, id.vars = 'year')
ggplot(data, aes(year, value, colour = variable))+
geom_line()+
labs(x = "Time (years)", y = "Population size",
title = "B) Anchovy population growth \n in less productive environment")+
theme_minimal()
Here is what I would do.
First, since the computations for lp and sp are the same, only the initial values change, create a function to do it.
f <- function(initial, MAX, rlp, K){
x <- numeric(MAX)
x[1] <- initial
for(i in seq_len(MAX - 1)) {
x[i + 1] <- x[i] + rlp*(1 - x[i]/K)*x[i]
}
x
}
Now sapply the function to a vector of initial values.
MAX.Y <- 30
rlp <- 0.1
year <- seq_len(MAX.Y)
K <- 4000000
InitialValues <- setNames(c(4000, 100000), c("lp", "sp"))
x <- sapply(InitialValues, f, MAX.Y, rlp, K)
And plot it with matlines. But for matlines to work the plot must be created with the custom title, axis limits, etc.
plot(1, type = "n",
xlim = range(year), ylim = range(x),
main = c(paste("B) Anchovy population growth"), paste ("in less productive environment")),
xlab = "Time (years)",
ylab = "Population size",
cex.main = 1.0,
col = c("darkorchid", "black"))
matlines(x, lty = "solid")

Variation on "How to plot decision boundary of a k-nearest neighbor classifier from Elements of Statistical Learning?"

This is a question related to https://stats.stackexchange.com/questions/21572/how-to-plot-decision-boundary-of-a-k-nearest-neighbor-classifier-from-elements-o
For completeness, here's the original example from that link:
library(ElemStatLearn)
require(class)
x <- mixture.example$x
g <- mixture.example$y
xnew <- mixture.example$xnew
mod15 <- knn(x, xnew, g, k=15, prob=TRUE)
prob <- attr(mod15, "prob")
prob <- ifelse(mod15=="1", prob, 1-prob)
px1 <- mixture.example$px1
px2 <- mixture.example$px2
prob15 <- matrix(prob, length(px1), length(px2))
par(mar=rep(2,4))
contour(px1, px2, prob15, levels=0.5, labels="", xlab="", ylab="", main=
"15-nearest neighbour", axes=FALSE)
points(x, col=ifelse(g==1, "coral", "cornflowerblue"))
gd <- expand.grid(x=px1, y=px2)
points(gd, pch=".", cex=1.2, col=ifelse(prob15>0.5, "coral", "cornflowerblue"))
box()
I've been playing with that example, and would like to try to make it work with three classes. I can change some values of g with something like
g[8:16] <- 2
just to pretend that there are some samples which are from a third class. I can't make the plot work, though. I guess I need to change the lines that deal with the proportion of votes for winning class:
prob <- attr(mod15, "prob")
prob <- ifelse(mod15=="1", prob, 1-prob)
and also the levels on the contour:
contour(px1, px2, prob15, levels=0.5, labels="", xlab="", ylab="", main=
"15-nearest neighbour", axes=FALSE)
I am also not sure contour is the right tool for this. One alternative that works is to create a matrix of data that covers the region I'm interested, classify each point of this matrix and plot those with a large marker and different colors, similar to what is being done with the points(gd...) bit.
The final purpose is to be able to show different decision boundaries generated by different classifiers. Can someone point me to the right direction?
thanks
Rafael
Separating the main parts in the code will help outlining how to achieve this:
Test data with 3 classes
train <- rbind(iris3[1:25,1:2,1],
iris3[1:25,1:2,2],
iris3[1:25,1:2,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
Test data covering a grid
require(MASS)
test <- expand.grid(x=seq(min(train[,1]-1), max(train[,1]+1),
by=0.1),
y=seq(min(train[,2]-1), max(train[,2]+1),
by=0.1))
Classification for that grid
3 classes obviously
require(class)
classif <- knn(train, test, cl, k = 3, prob=TRUE)
prob <- attr(classif, "prob")
Data structure for plotting
require(dplyr)
dataf <- bind_rows(mutate(test,
prob=prob,
cls="c",
prob_cls=ifelse(classif==cls,
1, 0)),
mutate(test,
prob=prob,
cls="v",
prob_cls=ifelse(classif==cls,
1, 0)),
mutate(test,
prob=prob,
cls="s",
prob_cls=ifelse(classif==cls,
1, 0)))
Plot
require(ggplot2)
ggplot(dataf) +
geom_point(aes(x=x, y=y, col=cls),
data = mutate(test, cls=classif),
size=1.2) +
geom_contour(aes(x=x, y=y, z=prob_cls, group=cls, color=cls),
bins=2,
data=dataf) +
geom_point(aes(x=x, y=y, col=cls),
size=3,
data=data.frame(x=train[,1], y=train[,2], cls=cl))
We can also be a little fancier and plot the probability of class membership as a indication of the "confidence".
ggplot(dataf) +
geom_point(aes(x=x, y=y, col=cls, size=prob),
data = mutate(test, cls=classif)) +
scale_size(range=c(0.8, 2)) +
geom_contour(aes(x=x, y=y, z=prob_cls, group=cls, color=cls),
bins=2,
data=dataf) +
geom_point(aes(x=x, y=y, col=cls),
size=3,
data=data.frame(x=train[,1], y=train[,2], cls=cl)) +
geom_point(aes(x=x, y=y),
size=3, shape=1,
data=data.frame(x=train[,1], y=train[,2], cls=cl))

Adding markers to 3D plot in R

I need to mark where certain observations appear in a 3D plotted joint density function -- I envision adding a vector, (x, y, f(x,y) + something_small) to the density plot, showing where the point is. I have tried using trans3d(), but that hasn't worked.
Here is an example:
library(MASS)
Sigma <- matrix(c(12,1,1,12),2,2)
Sample <- mvrnorm(n=1000, rep(0, 2), Sigma)
empDen <- kde2d(Sample[,1],Sample[,2])
par(bg = "white")
x <- empDen$x
y <- empDen$y
z <- empDen$z
nrz <- nrow(z)
ncz <- ncol(z)
jet.colors <- colorRampPalette( c("lightblue", "blue") )
nbcol <- 100
color <- jet.colors(nbcol)
zfacet <- z[-1, -1] + z[-1, -ncz] + z[-nrz, -1] + z[-nrz, -ncz]
facetcol <- cut(zfacet, nbcol)
persp(x, y, z, col = color[facetcol], phi = 15, theta = -50, xlab="x", ylab="y", zlab="Empirical Joint Density", border=NA)
The question is: How do I indicate where Sample[1,] appears in the joint density, i.e. add this to the plot?
Thanks for any tips!
This works:
fmt=persp(x, y, z, col = color[facetcol], phi = 15, theta = -50, xlab="x", ylab="y", zlab="Empirical Joint Density", border=NA)
pt = Sample[1,]
points(trans3d(pt[1],pt[2],.001,fmt),pch=20, col="Red")
lines(trans3d(c(pt[1],pt[1]), c(pt[2],pt[2]), c(0,.001),fmt),col="Red",cex=2)
Although, it would be nice to replace .001 with some information based off the empirical joint density instead of manually specifying values for each point.

How to plot a normal distribution by labeling specific parts of the x-axis?

I am using the following code to create a standard normal distribution in R:
x <- seq(-4, 4, length=200)
y <- dnorm(x, mean=0, sd=1)
plot(x, y, type="l", lwd=2)
I need the x-axis to be labeled at the mean and at points three standard deviations above and below the mean. How can I add these labels?
The easiest (but not general) way is to restrict the limits of the x axis. The +/- 1:3 sigma will be labeled as such, and the mean will be labeled as 0 - indicating 0 deviations from the mean.
plot(x,y, type = "l", lwd = 2, xlim = c(-3.5,3.5))
Another option is to use more specific labels:
plot(x,y, type = "l", lwd = 2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
Using the code in this answer, you could skip creating x and just use curve() on the dnorm function:
curve(dnorm, -3.5, 3.5, lwd=2, axes = FALSE, xlab = "", ylab = "")
axis(1, at = -3:3, labels = c("-3s", "-2s", "-1s", "mean", "1s", "2s", "3s"))
But this doesn't use the given code anymore.
If you like hard way of doing something without using R built in function or you want to do this outside R, you can use the following formula.
x<-seq(-4,4,length=200)
s = 1
mu = 0
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2))
plot(x,y, type="l", lwd=2, col = "blue", xlim = c(-3.5,3.5))
An extremely inefficient and unusual, but beautiful solution, which works based on the ideas of Monte Carlo simulation, is this:
simulate many draws (or samples) from a given distribution (say the normal).
plot the density of these draws using rnorm. The rnorm function takes as arguments (A,B,C) and returns a vector of A samples from a normal distribution centered at B, with standard deviation C.
Thus to take a sample of size 50,000 from a standard normal (i.e, a normal with mean 0 and standard deviation 1), and plot its density, we do the following:
x = rnorm(50000,0,1)
plot(density(x))
As the number of draws goes to infinity this will converge in distribution to the normal. To illustrate this, see the image below which shows from left to right and top to bottom 5000,50000,500000, and 5 million samples.
In general case, for example: Normal(2, 1)
f <- function(x) dnorm(x, 2, 1)
plot(f, -1, 5)
This is a very general, f can be defined freely, with any given parameters, for example:
f <- function(x) dbeta(x, 0.1, 0.1)
plot(f, 0, 1)
I particularly love Lattice for this goal. It easily implements graphical information such as specific areas under a curve, the one you usually require when dealing with probabilities problems such as find P(a < X < b) etc.
Please have a look:
library(lattice)
e4a <- seq(-4, 4, length = 10000) # Data to set up out normal
e4b <- dnorm(e4a, 0, 1)
xyplot(e4b ~ e4a, # Lattice xyplot
type = "l",
main = "Plot 2",
panel = function(x,y, ...){
panel.xyplot(x,y, ...)
panel.abline( v = c(0, 1, 1.5), lty = 2) #set z and lines
xx <- c(1, x[x>=1 & x<=1.5], 1.5) #Color area
yy <- c(0, y[x>=1 & x<=1.5], 0)
panel.polygon(xx,yy, ..., col='red')
})
In this example I make the area between z = 1 and z = 1.5 stand out. You can move easily this parameters according to your problem.
Axis labels are automatic.
This is how to write it in functions:
normalCriticalTest <- function(mu, s) {
x <- seq(-4, 4, length=200) # x extends from -4 to 4
y <- (1/(s * sqrt(2*pi))) * exp(-((x-mu)^2)/(2*s^2)) # y follows the formula
of the normal distribution: f(Y)
plot(x,y, type="l", lwd=2, xlim = c(-3.5,3.5))
abline(v = c(-1.96, 1.96), col="red") # draw the graph, with 2.5% surface to
either side of the mean
}
normalCriticalTest(0, 1) # draw a normal distribution with vertical lines.
Final result:

Resources