Problem with theoretical density functions plotted to histograms in R - r

I did simulations in R and plotted the results in histograms. There is no problem adding probability density into the histograms with the density() function. But for some reason I get very strange results when I plot the theoretical densities to the histograms for comparison purposes. Here are two example codes and two pictures. The blue theoretical pdfs are way off and I do not know why. Could someone with R skills point out my mistakes?
# generating 10000 samples thrice from U(0,1) distribution
# and sorting them for the statistics:
n <- 10000
samples1 <- data.frame('0'=c(rep(NA,4)))
samples2 <- data.frame('0'=c(rep(NA,10)))
samples3 <- data.frame('0'=c(rep(NA,10)))
for (i in 1:n) {
new <- runif(4)
samples1[ , ncol(samples1) + 1] <- sort(new)
colnames(samples1)[ncol(samples1)] <- i
new <- runif(10)
samples2[ , ncol(samples2)+1] <- sort(new)
colnames(samples2)[ncol(samples2)] <- i
new <- runif(10)
samples3[ , ncol(samples3)+1] <- sort(new)
colnames(samples3)[ncol(samples3)] <- i
}
# dropping the first (useless) columns:
samples1 <- samples1[-c(1)]
samples2 <- samples2[-c(1)]
samples3 <- samples3[-c(1)]
# selecting the statistics from the samples:
# X_2:4
stat24 <- rep(NA,n)
for (i in 1:n) {
stat24[i] <- samples1[2,i]
}
# X_2:10
stat210 <- rep(NA,n)
for (i in 1:n) {
stat210[i] <- samples1[2,i]
}
# X_10:10
stat1010 <- rep(NA,n)
for (i in 1:n) {
stat1010[i] <- samples1[10,i]
}
# plotting the histograms and Beta pdfs:
hist(stat24, freq = FALSE)
lines(dbeta(stat24, 2, 5), col='blue')
lines(density(stat24), col='red')
Distribution of the kth statistic follows Beta(k, n+k-1) distribution which appears as the odd blue stroke on the right.
n <- 10000
random_variable_F <- rep(NA,n)
# generating 10000 samples of sizes 10 and 5 and computing F:
for (i in 1:n) {
x <- rnorm(10, mean = 10, sd = sqrt(5))
y <- rnorm(5, mean = 20, sd = sqrt(10))
random_variable_F[i] <- ((var(x))*5)/((var(y)*10))
}
#head(random_variable_F)
# plotting the histogram:
hist(random_variable_F, freq = F)
lines(density(random_variable_F), col='red')
lines(df(random_variable_F, 9, 4,), col='blue')
Random variable F follows F-distribution. Paramaters are the sample sizes minus one, in this case 10-1=9 and 5-1=4. the theoretical curve is quite wild:

If you pass a single vector to lines, it assumes that this is a vector of y values you want to plot. It plots the first y value at x = 1, the second y value at x = 2, etc, all the way up to x = length(y). In your case, random_variable_F is an unordered random variable, and you are just plotting its sequential values at 1:10000 along the x axis.
Clearly, you want the function y = df(x) to be plotted, so you need to pass random_variable_F as the x values and df(random_variable_F) as the y values. You will also need to sort random_variable_F first to ensure the line is plotted from left to right:
hist(random_variable_F, freq = F)
lines(density(random_variable_F), col='red')
lines(sort(random_variable_F), df(sort(random_variable_F), 9, 4), col='blue')
Note that this doesn't happen when you plot lines(density(random_variable_F)) because density produces a list containing ordered x and y valued rather than a vector.

Related

How to modify the code below to present the bias of three estimators?

I have a Uniform distribution with parameters (0,$\theta$),
and I have calculated its MME, and the other two estimators.
I also calculated their individual bias in my code.
But I am having trouble properly plot the three biases in one plot.
I wonder would you mind providing me with any suggestion that how can I modify the "matplot" part. Maybe the question lies there? Thank you very much!!
n <- 1000
theta <- 9
x <- runif(n, 0, theta)
theta.hat.mme <-2*mean(x)
theta.hat.G <-(n*theta)/(n+1)
theta.hat.H <-((n+1)/n)*theta.hat.G
# theta.hat.G is the max of (x_1, x_2,...,x_n) be an estimator of theta, and theta.hat.H is the modified estimator of theta.hat.G to make it unbiased.
# Empirical Exploration of properties
M <- 1000 # Number of sample of size n
L <- 9 # Different values of n
# create storage variables
mme <- numeric(M)
vn <- numeric(L)
bias <- variance <- matrix(0, nrow=L, ncol=3)
for(l in 1:L)
{
n <- 20 + (l-1)*10
for(m in 1:M)
{
x <- runif(n, 0, theta)
mme[m] <- 2*mean(x)
}
bias[l,1] <- mean(mme)-theta
bias[1,2] <-mean(theta.hat.G)-theta
bias[1,3] <-mean(theta.hat.H)-theta
variance[l,1] <- var(mme)
variance[1,2] <-var(theta.hat.G)
variance[1,3] <-var(theta.hat.H)
vn[l] <- n
}
#windows(5,5)
matplot(vn, bias, type='l', col=1:3, lty=1:3, xlab='n')
abline(h=0, lty=3)
legend('topright', col=1:3,
c('Method of Moments Estimator', 'Theta G', 'Theta H'),
inset=0.01, lty=1:2)

How to generate a probability density function and expectation in r?

The task:
Eric the fly has a friend, Ernie. Assume that the two flies sit at independent locations, uniformly distributed on the globe’s surface. Let D denote the Euclidean distance between Eric and Ernie (i.e., on a straight line through the interior of the globe).
Make a conjecture about the probability density function of D and give an
estimate of its expected value, E(D).
So far I have made a function to generate two points on the globe's surface, but I am unsure what to do next:
sample3d <- function(2)
{
df <- data.frame()
while(n > 0){
x <- runif(1,-1,1)
y <- runif(1,-1,1)
z <- runif(1,-1,1)
r <- x^2 + y^2 + z^2
if (r < 1){
u <- sqrt(x^2+y^2+z^2)
vector = data.frame(x = x/u,y = y/u, z = z/u)
df <- rbind(vector,df)
n = n- 1
}
}
df
}
E <- sample3d(2)
This is an interesting problem. I'll outline a computational approach; I'll leave the math up to you.
First we fix a random seed for reproducibility.
set.seed(2018);
We sample 10^4 points from the unit sphere surface.
sample3d <- function(n = 100) {
df <- data.frame();
while(n > 0) {
x <- runif(1,-1,1)
y <- runif(1,-1,1)
z <- runif(1,-1,1)
r <- x^2 + y^2 + z^2
if (r < 1) {
u <- sqrt(x^2 + y^2 + z^2)
vector = data.frame(x = x/u,y = y/u, z = z/u)
df <- rbind(vector,df)
n = n- 1
}
}
df
}
df <- sample3d(10^4);
Note that sample3d is not very efficient, but that's a different issue.
We now randomly sample 2 points from df, calculate the Euclidean distance between those two points (using dist), and repeat this procedure N = 10^4 times.
# Sample 2 points randomly from df, repeat N times
N <- 10^4;
dist <- replicate(N, dist(df[sample(1:nrow(df), 2), ]));
As pointed out by #JosephWood, the number N = 10^4 is somewhat arbitrary. We are using a bootstrap to derive the empirical distribution. For N -> infinity one can show that the empirical bootstrap distribution is the same as the (unknown) population distribution (Bootstrap theorem). The error term between empirical and population distribution is of the order 1/sqrt(N), so N = 10^4 should lead to an error around 1%.
We can plot the resulting probability distribution as a histogram:
# Let's plot the distribution
ggplot(data.frame(x = dist), aes(x)) + geom_histogram(bins = 50);
Finally, we can get empirical estimates for the mean and median.
# Mean
mean(dist);
#[1] 1.333021
# Median
median(dist);
#[1] 1.41602
These values are close to the theoretical values:
mean.th = 4/3
median.th = sqrt(2)

How to plot nicely-spaced data labels?

Labeling data points in a plot can get unwieldy:
Randomly sampling few labels may disappoint:
What would be a nice way to pick a small set of nicely-spaced data labels? That is, to randomly pick representatives whose labels are not overlapping.
# demo data
set.seed(123)
N <- 50
x <- runif(N)
y <- x + rnorm(N, 0, x)
data <- data.frame(x, y, labels=state.name)
# plot with labels
plot(x,y)
text(x,y,labels)
# plot a few labels
frame()
few_labels <- data[sample(N, 10), ]
plot(x,y)
with(few_labels, text(x,y,labels))
One way to do is through clustering. Here is a solution with stats::hclust. We agglomerate the data points in cluster and then pick one random observation from each cluster.
few_labels <- function(df, coord=1:ncol(df),grp=5){
require(dplyr)
df$cl <- cutree(hclust(dist(df[,coord])),grp)
few_labels <- df %>% group_by(cl) %>%
do(sample_n(.,1))
return(few_labels)
}
# demo data
set.seed(123)
N <- 50
x <- runif(N)
y <- x + rnorm(N, 0, x)
data <- data.frame(x, y, labels=state.name)
# plot a few labels
frame()
few_labels <- few_labels(data,coord=1:2,grp=12)
plot(x,y)
with(few_labels, text(x,y,labels))
For all labels:
xlims=c(-1,2)
plot(x,y,xlim=xlims)
#text(x,y,data$labels,pos = 2,cex=0.7)
library(plotrix)
spread.labels(x,y,data$labels,cex=0.7,ony=NA)
Another way is to pick randomly a point, throw all proximate ones, and so on, until no point is left:
radius <- .1 # of a ball containing the largest label
d <- as.matrix(dist(data[, c("x","y")], upper=TRUE, diag=TRUE))
remaining <- 1:N
spaced <- numeric()
i <- 1
while(length(remaining)>0) {
p <- ifelse(length(remaining)>1, sample(remaining, 1), remaining)
spaced <- c(spaced, p) # ...
remaining <- setdiff(remaining, which(d[p, ] < 2*radius))
i <- i + 1
}
frame()
plot(x,y)
spaced_labels <- data[spaced, ]
with(spaced_labels, text(x,y,labels))

R plot heatmap of matrix with superposed line

I have a matrix whose values I want to plot. Using the image function it looks like this.
How can I plot a line over the image?
(In my case, it want to plot a line over the maximum values along the x axes)
Edit
The image and the line I want to plot over are the output from the Bayesian Online Changepoint detection. Since it is quite short I'll share the whole code. The plotting part is at the end:
# Bayesian Online Changepoint Detection
# Adams, MacKay 2007
# http://hips.seas.harvard.edu/content/bayesian-online-changepoint-detection
#######################################
# Other python and matlab implementations
# https://github.com/JackKelly/bayesianchangepoint
# http://hips.seas.harvard.edu/content/bayesian-online-changepoint-detection
# http://www.inference.phy.cam.ac.uk/rpa23/cp/gaussdemo.m
# http://www.inference.phy.cam.ac.uk/rpa23/cp/studentpdf.m
# http://www.inference.phy.cam.ac.uk/rpa23/cp/constant_hazard.m
# Even more commented, but different paper:
# https://github.com/davyfeng/ipdata/blob/master/csv/bocpd/core/bocpd.m
# Generate data
x1 <- rnorm(100, 10.5, 0.1)
x2 <- rnorm(100, 1, 0.1)
x3 <- rnorm(100, -10, 0.1)
x4 <- rnorm(100, -1, 0.1)
x5 <- rnorm(100, 5, 0.1)
x6 <- rnorm(30, 1, 0.1)
x7 <- rnorm(100, 8, 0.1)
x <- c(x1,x2,x3,x4,x5, x6,x7)
##############
# Algorithm
##############
# Prepare the scaled and shifted student-t
dt.scaled.shifted <- function(x, m, s, df) stats::dt((x-m)/s, df)/s
# Prepare the Hazard function
hazard <-function(x, lambda=200){rep(1/lambda, length(x))}
L <- length(x)
R <- matrix(rep(0,(L+1)*(L+1)), L+1, L+1)
R[1,1] <- 1 # for t=1 where are sure that p(r=1)=1
mu0 <- 0; kappa0 <- 1; alpha0 <-1; beta0 <- 1;
muT <- mu0
kappaT <- kappa0
alphaT <- alpha0
betaT <- beta0
maxes <- rep(0, L)
# Process data as they come in
for(t in 1:L){
# Evaluate predictive probability
predprobs <- dt.scaled.shifted(x[t], muT, betaT*(kappaT+1)/(alphaT*-kappaT), 2*alphaT)
H <- hazard(x[1:t])
# Calculate growth probabilities
R[2:(t+1), t+1] <- R[1:t,t]*predprobs*(1-H)
# Calculate changepoint (reset) probabilities
R[1,t+1] <- sum(R[1:t,t]*predprobs*H)
# Renormalize
R[,t+1] <- R[,t+1] / sum(R[,t+1])
# Update parameters for each possible run length
# keep the past ones since they will be used iteratively
muT0 <- c(mu0, (kappaT*muT + x[t])/(kappaT+1))
kappaT0 <- c(kappa0,kappaT+1)
alphaT0 <- c(alpha0, alphaT + 0.5)
betaT0 <- c(beta0, kappaT + (kappaT * (x[t]-muT)^2)/(2*(kappaT+1)))
muT <- muT0
kappaT <- kappaT0
alphaT <- alphaT0
betaT <- betaT0
# Store the maximum, to plot later
maxes[t] <- which.max(R[,t])
}
# Plot results
par(mfrow=c(2,1))
plot(x, type='l')
image((-t(log(R))), col = grey(seq(0,1,length=256)), axes=T)
par(new=T)
plot(1:(dim(R)[1]-1), maxes,type='l', col="red")
On the top there is the original data. On the bottom, the probability of a current run to have length y. The red line on the bottom should fit the dark shades.
(to be deleted. It does not work. I leave it temporaly to save the comments.)
I got it, I thought I had already tried par(new=T) but obviously I hadn't:
m <- matrix(rnorm(100,1,1),50,50)
image(m, col = grey(seq(0,1,length=256)))
par(new=T)
plot(seq(0,1, length=50), type='l', col="red", lwd=5)
Quick example simulating the whole process:
data <- vector()
for(i in 1:50){
data <- rbind(data, dpois(1:50, i^1.2))
}
maxes <- apply(data, 1, which.max)
image(-data, col = grey(seq(0,1,length=256)))
par(new=T)
plot(1:dim(data)[1], c(maxes),type='l')

Adding two random variables via convolution in R

I would like to compute the convolution of two probability distributions in R and I need some help. For the sake of simplicity, let's say I have a variable x that is normally distributed with mean = 1.0 and stdev = 0.5, and y that is log-normally distributed with mean = 1.5 and stdev = 0.75. I want to determine z = x + y. I understand that the distribution of z is not known a priori.
As an aside the real world example I am working with requires addition to two random variables that are distributed according to a number of different distributions.
Does anyone know how to add two random variables by convoluting the probability density functions of x and y?
I have tried generating n normally distributed random values (with above parameters) and adding them to n log-normally distributed random values. However, I wish to know if I can use the convolution method instead. Any help would be greatly appreciated.
EDIT
Thank you for these answers. I define a pdf, and try to do the convolution integral, but R complains on the integration step. My pdfs are Log Pearson 3 and are as follows
dlp3 <- function(x, a, b, g) {
p1 <- 1/(x*abs(b) * gamma(a))
p2 <- ((log(x)-g)/b)^(a-1)
p3 <- exp(-1* (log(x)-g) / b)
d <- p1 * p2 * p3
return(d)
}
f.m <- function(x) dlp3(x,3.2594,-0.18218,0.53441)
f.s <- function(x) dlp3(x,9.5645,-0.07676,1.184)
f.t <- function(z) integrate(function(x,z) f.s(z-x)*f.m(x),-Inf,Inf,z)$value
f.t <- Vectorize(f.t)
integrate(f.t, lower = 0, upper = 3.6)
R complains at the last step since the f.t function is bounded and my integration limits are probably not correct. Any ideas on how to solve this?
Here is one way.
f.X <- function(x) dnorm(x,1,0.5) # normal (mu=1.5, sigma=0.5)
f.Y <- function(y) dlnorm(y,1.5, 0.75) # log-normal (mu=1.5, sigma=0.75)
# convolution integral
f.Z <- function(z) integrate(function(x,z) f.Y(z-x)*f.X(x),-Inf,Inf,z)$value
f.Z <- Vectorize(f.Z) # need to vectorize the resulting fn.
set.seed(1) # for reproducible example
X <- rnorm(1000,1,0.5)
Y <- rlnorm(1000,1.5,0.75)
Z <- X + Y
# compare the methods
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
Same thing using package distr.
library(distr)
N <- Norm(mean=1, sd=0.5) # N is signature for normal dist
L <- Lnorm(meanlog=1.5,sdlog=0.75) # same for log-normal
conv <- convpow(L+N,1) # object of class AbscontDistribution
f.Z <- d(conv) # distribution function
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
I was having trouble getting integrate() to work for different density parameters, so I came up with an alternative to #jlhoward's using Riemann approximation:
set.seed(1)
#densities to be convolved. could also put these in the function below
d1 <- function(x) dnorm(x,1,0.5) #
d2 <- function(y) dlnorm(y,1.5, 0.75)
#Riemann approximation of convolution
conv <- function(t, a, b, d) { #a to b needs to cover the range of densities above. d needs to be small for accurate approx.
z <- NA
x <- seq(a, b, d)
for (i in 1:length(t)){
print(i)
z[i] <- sum(d1(x)*d2(t[i]-x)*d)
}
return(z)
}
#check against sampled convolution
X <- rnorm(1000, 1, 0.5)
Y <- rlnorm(1000, 1.5, 0.75)
Z <- X + Y
t <- seq(0, 50, 0.05) #range to evaluate t, smaller increment -> smoother curve
hist(Z, breaks = 50, freq = F, xlim = c(0,30))
lines(t, conv(t, -100, 100, 0.1), type = "s", col = "red")

Resources