Related
Find the maximum likelihood estimate of p in a binomial distribution characterized by 9
successes in 20 trials. Show graphically that this is the maximum. Include your R code with
your answers.
This is a prompt I've been given for a homework assignment but the teacher never explained how to do it. I understand the basic concept, but I'm not sure how to find the estimate of p in R, or how to graph it. Any advice/help would be greatly appreciated!
The exercise is not that difficult.
Write down two functions, the textbook likelihood and log-likelihood;
Use optim to maximize them;
Below the tol argument is smaller than the default. Since both functions are continuous, finite and convex their maxima are guaranteed to exist and a smaller tol will find better maxima values;
Finally, plot the functions and the maxima found earlier.
ll <- function(x, n, k) choose(n, k) * x^k * (1 - x)^(n - k)
log_ll <- function(x, n, k) lchoose(n, k) + k*log(x) + (n - k)*log(1 - x)
xmax <- optimize(ll, c(0, 1), n = 20, k = 9, maximum = TRUE, tol = .Machine$double.eps^0.5)
xmax$maximum
#> [1] 0.45
xmax_log <- optimize(log_ll, c(0, 1), n = 20, k = 9, maximum = TRUE, tol = .Machine$double.eps^0.5)
xmax_log$maximum
#> [1] 0.45
# save the default graphics parameters
old_par <- par(mfrow = c(2, 1), mai = c(0.8, 1, 0.1, 1))
#
curve(ll(x, n = 20, k = 9), from = 0, to = 1, xlab = "")
segments(x0 = xmax$maximum, y0 = -1, y1 = xmax$objective, lty = "dashed")
segments(x0 = -1, x1 = xmax$maximum, y0 = xmax$objective, lty = "dashed")
points(xmax$maximum, xmax$objective, col = "red", pch = 16)
#
curve(log_ll(x, n = 20, k = 9), from = 0, to = 1, xlab = "Binomial proportion")
segments(x0 = xmax_log$maximum, y0 = -50, y1 = xmax_log$objective, lty = "dashed")
segments(x0 = -1, x1 = xmax_log$maximum, y0 = xmax_log$objective, lty = "dashed")
points(xmax_log$maximum, xmax_log$objective, col = "red", pch = 16)
#
# restore the default graphics parameters
par(old_par)
Created on 2022-10-11 with reprex v2.0.2
I have trouble understanding how to set the levels in the plot of a bivariate distribution in r. The documentation states that I can choose the levels by setting a
numeric vector of levels at which to draw contour lines
Now I would like the contour to show the limit containing 95% of the density or mass. But if, in the example below (adapted from here) I set the vector as a <- c(.95,.90) the code runs without error but the plot is not displayed. If instead, I set the vector as a <- c(.01,.05) the plot is displayed. But I am not sure I understand what the labels "0.01" and "0.05" mean with respect to the density.
library(mnormt)
x <- seq(-5, 5, 0.25)
y <- seq(-5, 5, 0.25)
mu1 <- c(0, 0)
sigma1 <- matrix(c(2, -1, -1, 2), nrow = 2)
f <- function(x, y) dmnorm(cbind(x, y), mu1, sigma1)
z <- outer(x, y, f)
a <- c(.01,.05)
contour(x, y, z, levels = a)
But I am not sure I understand what the labels "0.01" and "0.05" mean with respect to the density.
It means the points where the density is equal 0.01 and 0.05. From help("contour"):
numeric vector of levels at which to draw contour lines.
So it is the function values at which to draw the lines (contours) where the function is equal to those levels (in this case the density). Take a simple example which may help is x + y:
y <- x <- seq(0, 1, length.out = 50)
z <- outer(x, y, `+`)
par(mar = c(5, 5, 1, 1))
contour(x, y, z, levels = c(0.5, 1, 1.5))
Now I would like the contour to show the limit containing 95% of the density or mass.
In your example, you can follow my answer here and draw the exact points:
# input
mu1 <- c(0, 0)
sigma1 <- matrix(c(2, -1, -1, 2), nrow = 2)
# we start from points on the unit circle
n_points <- 100
xy <- cbind(sin(seq(0, 2 * pi, length.out = n_points)),
cos(seq(0, 2 * pi, length.out = n_points)))
# then we scale the dimensions
ev <- eigen(sigma1)
xy[, 1] <- xy[, 1] * 1
xy[, 2] <- xy[, 2] * sqrt(min(ev$values) / max(ev$values))
# then rotate
phi <- atan(ev$vectors[2, 1] / ev$vectors[1, 1])
R <- matrix(c(cos(phi), sin(phi), -sin(phi), cos(phi)), 2)
xy <- tcrossprod(R, xy)
# find the right length. You can change .95 to which ever
# quantile you want
chi_vals <- qchisq(.95, df = 2) * max(ev$values)
s <- sqrt(chi_vals)
par(mar = c(5, 5, 1, 1))
plot(s * xy[1, ] + mu1[1], s * xy[2, ] + mu1[2], lty = 1,
type = "l", xlab = "x", ylab = "y")
The levels indicates where the lines are drawn, with respect to the specific 'z' value of the bivariate normal density. Since max(z) is
0.09188815, levels of a <- c(.95,.90) can't be drawn.
To draw the line delimiting 95% of the mass I used the ellipse() function as suggested in this post (second answer from the top).
library(mixtools)
library(mnormt)
x <- seq(-5, 5, 0.25)
y <- seq(-5, 5, 0.25)
mu1 <- c(0, 0)
sigma1 <- matrix(c(2, -1, -1, 2), nrow = 2)
f <- function(x, y) dmnorm(cbind(x, y), mu1, sigma1)
z <- outer(x, y, f)
a <- c(.01,.05)
contour(x, y, z, levels = a)
ellipse(mu=mu1, sigma=sigma1, alpha = .05, npoints = 250, col="red")
I also found another solution in the book "Applied Multivariate Statistics with R" by Daniel Zelterman.
# Figure 6.5: Bivariate confidence ellipse
library(datasets)
library(MASS)
library(MVA)
#> Loading required package: HSAUR2
#> Loading required package: tools
biv <- swiss[, 2 : 3] # Extract bivariate data
bivCI <- function(s, xbar, n, alpha, m)
# returns m (x,y) coordinates of 1-alpha joint confidence ellipse of mean
{
x <- sin( 2* pi * (0 : (m - 1) )/ (m - 1)) # m points on a unit circle
y <- cos( 2* pi * (0 : (m - 1)) / (m - 1))
cv <- qchisq(1 - alpha, 2) # chisquared critical value
cv <- cv / n # value of quadratic form
for (i in 1 : m)
{
pair <- c(x[i], y[i]) # ith (x,y) pair
q <- pair %*% solve(s, pair) # quadratic form
x[i] <- x[i] * sqrt(cv / q) + xbar[1]
y[i] <- y[i] * sqrt(cv / q) + xbar[2]
}
return(cbind(x, y))
}
### pdf(file = "bivSwiss.pdf")
plot(biv, col = "red", pch = 16, cex.lab = 1.5)
lines(bivCI(var(biv), colMeans(biv), dim(biv)[1], .01, 1000), type = "l",
col = "blue")
lines(bivCI(var(biv), colMeans(biv), dim(biv)[1], .05, 1000),
type = "l", col = "green", lwd = 1)
lines(colMeans(biv)[1], colMeans(biv)[2], pch = 3, cex = .8, type = "p",
lwd = 1)
Created on 2021-03-15 by the reprex package (v0.3.0)
I am plotting curves for different distribution functions and I need to know the highest y-value for each curve. Later I will plot only the one curve, which is selected as the best fitting.
This is the function (it is a bit hard-coded, I am working on it):
library(plyr)
library(dplyr)
library(fitdistrplus)
library(evd)
library(gamlss)
fdistr <- function(d) {
# Uncomment to try run line by line
# d <- data_to_plot
TLT <- d$TLT
if (sum(TLT<=0)) {TLT[TLT<=0] <- 0.001} # removing value < 0 for log clculation
gev <- fgev(TLT, std.err=FALSE)
distr <- c('norm', 'lnorm', 'weibull', 'gamma')
fit <- lapply(X=distr, FUN=fitdist, data=TLT)
fit[[5]] <- gev
distr[5] <- 'gev'
names(fit) <- distr
Loglike <- sapply(X=fit, FUN=logLik)
Loglike_Best <- which(Loglike == max(Loglike))
# Uncomment to try run line by line
# max <- which.max(density(d$TLT)$y)
# max_density <- stats::density(d$TLT)$y[max]
# max_y <- max_density
x_data <- max(d$TLT)
hist(TLT, prob=TRUE, breaks= x_data,
main=paste(d$DLT_Code[1],
'- best :',
names(Loglike[Loglike_Best])),
sub = 'Total Lead Times',
col='lightgrey',
border='white'
# ylim= c(0,max_y)
)
lines(density(TLT),
col='darkgrey',
lty=2,
lwd=2)
grid(nx = NA, ny = NULL, col = "gray", lty = "dotted",
lwd = .5, equilogs = TRUE)
curve(dnorm(x,
mean=fit[['norm']]$estimate[1],
sd=fit[['norm']]$estimate[2]),
add=TRUE, col='blue', lwd=2)
curve(dlnorm(x,
meanlog=fit[['lnorm']]$estimate[1],
sdlog=fit[['lnorm']]$estimate[2]),
add=TRUE, col='darkgreen', lwd=2)
curve(dweibull(x,
shape=fit[['weibull']]$estimate[1],
scale=fit[['weibull']]$estimate[2]),
add=TRUE, col='purple', lwd=2)
curve(dgamma(x,
shape=fit[['gamma']]$estimate[1],
rate=fit[['gamma']]$estimate[2]),
add=TRUE, col='Gold', lwd=2)
curve(dgev(x,
loc=fit[['gev']]$estimate[1],
scale=fit[['gev']]$estimate[2],
shape=fit[['gev']]$estimate[3]),
add=TRUE, col='red', lwd=2)
legend_loglik <- paste(c('Norm', 'LogNorm', 'Weibull', 'Gamma','GEV'), c(':'),
round(Loglike, digits=2))
legend("topright", legend=legend_loglik,
col=c('blue', 'darkgreen', 'purple', 'gold', 'red'),
lty=1, lwd=2,
bty='o', bg='white', box.lty=2, box.lwd = 1, box.col='white')
return(data.frame(DLT_Code = d$DLT_Code[1],
n = length(d$TLT),
Best = names(Loglike[Loglike_Best]),
lnorm = Loglike[1],
norm = Loglike[2],
weibul = Loglike[3],
gamma = Loglike[4],
GEV = Loglike[5]))
}
# Creating data set
TLT <- c(rep(0,32), rep(1,120), rep(2,10), rep(3,67), rep(4,14), rep(5,7), 6)
DLT_Code <- c(rep('DLT_Code',251))
data_to_plot <- data.frame(cbind(DLT_Code,TLT))
data_to_plot$TLT <- as.numeric(as.character(data_to_plot$TLT ))
DLT_Distr <- do.call(rbind, by(data = data_to_plot, INDICES = data_to_plot$DLT_Code, FUN=fdistr))
I was trying to play with max_y and then to use it in ylim. I could do it only for normal density, but not for the rest curves.
Currently plot looks like this (some curves are cut):
If to set up ylim = c(0,2) we can see, that lognormal and gamma distribution goes beyond 1:
I need to know the max value for each curve, so, when I choose which curve will be printed, to set up the correct ylim.
You could use purrr::map_dbl to map the function optimize over your densities if you rearrange your code slightly and you have an idea over what input values you want to find their maxima/the density exists.
You can set your densities with whatever your parameters are ahead of time, that way you can find their peak values using optimize and also pass them to the curve function.
As a small reproducible example:
library(purrr)
# parameterize your densities
mynorm <- function(x) dnorm(x, mean = 0, sd = 1)
mygamma <- function(x) dgamma(x, rate = .5, shape = 1)
# get largest maximum over interval
ymax <- max(purrr::map_dbl(c(mynorm, mygamma), ~ optimize(., interval = c(0, 3), maximum = T)$objective))
# 0.4999811
# plot data
curve(mynorm, col = "blue", lwd = 2, xlim = c(0, 3), ylim = c(0, ymax * 1.1))
curve(mygamma, col = "red", lwd = 2, add = T)
Using your code I've implemented the above solution and adjusted the x grid of the curve function to show you what I mean after our discussion in the comments to make things more clear and show you what you should actually be plotting:
library(plyr)
library(dplyr)
library(fitdistrplus)
library(evd)
library(gamlss)
library(purrr) # <- add this library
fdistr <- function(d) {
# Uncomment to try run line by line
# d <- data_to_plot
TLT <- d$TLT
if (sum(TLT<=0)) {TLT[TLT<=0] <- 0.001} # removing value < 0 for log clculation
gev <- fgev(TLT, std.err=FALSE)
distr <- c('norm', 'lnorm', 'weibull', 'gamma')
fit <- lapply(X=distr, FUN=fitdist, data=TLT)
fit[[5]] <- gev
distr[5] <- 'gev'
names(fit) <- distr
Loglike <- sapply(X=fit, FUN=logLik)
Loglike_Best <- which(Loglike == max(Loglike))
# Uncomment to try run line by line
# max <- which.max(density(d$TLT)$y)
# max_density <- stats::density(d$TLT)$y[max]
# max_y <- max_density
x_data <- max(d$TLT)
# parameterize your densities before plotting
mynorm <- function(x) {
dnorm(x,
mean=fit[['norm']]$estimate[1],
sd=fit[['norm']]$estimate[2])
}
mylnorm <- function(x){
dlnorm(x,
meanlog=fit[['lnorm']]$estimate[1],
sdlog=fit[['lnorm']]$estimate[2])
}
myweibull <- function(x) {
dweibull(x,
shape=fit[['weibull']]$estimate[1],
scale=fit[['weibull']]$estimate[2])
}
mygamma <- function(x) {
dgamma(x,
shape=fit[['gamma']]$estimate[1],
rate=fit[['gamma']]$estimate[2])
}
mygev <- function(x){
dgev(x,
loc=fit[['gev']]$estimate[1],
scale=fit[['gev']]$estimate[2],
shape=fit[['gev']]$estimate[3])
}
distributions <- c(mynorm, mylnorm, myweibull, mygamma, mygev)
# get the max of each density
y <- purrr::map_dbl(distributions, ~ optimize(., interval = c(0, x_data), maximum = T)$objective)
# find the max (excluding infinity)
ymax <- max(y[abs(y) < Inf])
hist(TLT, prob=TRUE, breaks= x_data,
main=paste(d$DLT_Code[1],
'- best :',
names(Loglike[Loglike_Best])),
sub = 'Total Lead Times',
col='lightgrey',
border='white',
ylim= c(0, ymax)
)
lines(density(TLT),
col='darkgrey',
lty=2,
lwd=2)
grid(nx = NA, ny = NULL, col = "gray", lty = "dotted",
lwd = .5, equilogs = TRUE)
curve(mynorm,
add=TRUE, col='blue', lwd=2, n = 1E5) # <- increase x grid
curve(mylnorm,
add=TRUE, col='darkgreen', lwd=2, n = 1E5) # <- increase x grid
curve(myweibull,
add=TRUE, col='purple', lwd=2, n = 1E5) # <- increase x grid
curve(mygamma,
add=TRUE, col='Gold', lwd=2, n = 1E5) # <- increase x grid
curve(mygev,
add=TRUE, col='red', lwd=2, n = 1E5) # <- increase x grid
legend_loglik <- paste(c('Norm', 'LogNorm', 'Weibull', 'Gamma','GEV'), c(':'),
round(Loglike, digits=2))
legend("topright", legend=legend_loglik,
col=c('blue', 'darkgreen', 'purple', 'gold', 'red'),
lty=1, lwd=2,
bty='o', bg='white', box.lty=2, box.lwd = 1, box.col='white')
return(data.frame(DLT_Code = d$DLT_Code[1],
n = length(d$TLT),
Best = names(Loglike[Loglike_Best]),
lnorm = Loglike[1],
norm = Loglike[2],
weibul = Loglike[3],
gamma = Loglike[4],
GEV = Loglike[5]))
}
# Creating data set
TLT <- c(rep(0,32), rep(1,120), rep(2,10), rep(3,67), rep(4,14), rep(5,7), 6)
DLT_Code <- c(rep('DLT_Code',251))
data_to_plot <- data.frame(cbind(DLT_Code,TLT))
data_to_plot$TLT <- as.numeric(as.character(data_to_plot$TLT ))
DLT_Distr <- do.call(rbind, by(data = data_to_plot, INDICES = data_to_plot$DLT_Code, FUN=fdistr))
Why your plot height isn't matching the solution output
To illustrate further what's going on with your plot and some of the confusion you might have you need to understand how the curve function is plotting your data. By default curve takes 101 x-values and evaluates your functions to get their y-values and then plots those points as a line. Because the peaks on some of your density are so sharp, the curve function isn't evaluating enough x-values to plot your density peaks. To show you want I mean I will focus on your gamma density. Don't worry too much about the code as much as the output. Below I have the first few (x,y) coordinates for different values of n.
library(purrr)
mygamma <- function(x) {
dgamma(x,
shape=fit[['gamma']]$estimate[1], # 0.6225622
rate=fit[['gamma']]$estimate[2]) # 0.3568242
}
number_of_x <- c(5, 10, 101, 75000)
purrr::imap_dfr(number_of_x, ~ curve(mygamma, xlim = c(0, 6), n = .), .id = "n") %>%
dplyr::mutate_at(1, ~ sprintf("n = %i", number_of_x[as.numeric(.)])) %>%
dplyr::mutate(n = factor(n, unique(n))) %>%
dplyr::filter(x > 0) %>%
dplyr::group_by(n) %>%
dplyr::slice_min(order_by = x, n = 5)
n x y
<fct> <dbl> <dbl>
1 n = 5 1.5 0.184
2 n = 5 3 0.0828
3 n = 5 4.5 0.0416
4 n = 5 6 0.0219
5 n = 10 0.667 0.336
6 n = 10 1.33 0.204
7 n = 10 2 0.138
8 n = 10 2.67 0.0975
9 n = 10 3.33 0.0707
10 n = 101 0.06 1.04
11 n = 101 0.12 0.780
12 n = 101 0.18 0.655
13 n = 101 0.24 0.575
14 n = 101 0.3 0.518
15 n = 75000 0.0000800 12.9
16 n = 75000 0.000160 9.90
17 n = 75000 0.000240 8.50
18 n = 75000 0.000320 7.62
19 n = 75000 0.000400 7.01
Notice that when n = 5 you have very few values plotted. As n increases, the distance between the x-values gets smaller. Since these functions are continuous, there are infinite number of points to plot, but that cannot be done computationally so a subset of x-values are plotted to approximate. The more x-values the better the approximation. Normally, the default n = 101 works fine, but because the gamma and log-normal densities have such sharp peaks the plot function is stepping over the maximum value. Below is a full plot of the data for n = 5, 10, 101, 75000 with points added.
Finally I have used this solution, found here:
mygamma <- function(x) dgamma(x, shape=fit[['gamma']]$estimate[1],
rate=fit[['gamma']]$estimate[2])
get_curve_values <- function(fn, x_data){
res <- curve(fn, from=0, to=x_data)
dev.off()
res
}
curve_val <- get_curve_values(mygamma, x_data)
ylim <- max(curve_val$y,na.rm = TRUE)
I want to model what it might look like to have the day length change smoothly over time (but remain sinusoidal). The formula for a "chirp", to change the instantaneous frequency is given at https://en.wikipedia.org/wiki/Chirp but it doesn't look right when coded for a 24h period over 5 days and then a transition to 12h over another 5 days:
period = list( c(24,24,5), c(24,12,5) )
alpha = list( c(0,5), c(0,5) )
s_samples = 100
A=50
O=50
simulatedData = data.frame(t=numeric(), v=numeric()) #initialise the output
daySteps = c(0, cumsum(unlist(period)[seq(3,length(unlist(period)), by=3)])) #set up the period starts and ends to set over, starting at 0
##Cycle over each of the items in the list
for(set in seq(period) ){
t_points = s_samples*period[[set]][3]
t = seq(daySteps[set], daySteps[set+1], length.out=t_points) #make the time
slope = (24/period[[set]][2]-24/period[[set]][1])/(max(t)-min(t)) # get the slope
f0 = 24/period[[set]][1] - slope*(min(t)) # find the freq when t0
c = (24/period[[set]][2]-f0)/(max(t)) #calculate the chirp see https://en.wikipedia.org/wiki/Chirp and https://dsp.stackexchange.com/questions/57904/chirp-after-t-seconds
wt = ((c*(t^2))/2) + f0*(t) # calc the freq
a = alpha[[set]][1]
v = A * cos(2*pi*wt - a) + O
simulatedData = rbind(simulatedData, data.frame(t, v) )
}
plot(simulatedData, type="l", lwd=2)
t = seq(0,sum(unlist(period)[seq(3,length(unlist(period)), by=3)]), by=1/24)
points(t, A*cos(2*pi*t)+O, col=3, type="l", lty=2)
points(t, A*cos(2*(24/12)*pi*t)+O, col=4, type="l", lty=2)
The first 24 are perfect, as expected, and the last part of the second 5 days matches a 12h cycled, but the first part of that period looks 180deg out of phase. What's wrong?
I think you're making this a lot more complex than it needs to be. Remember that many R functions are already vectorized. The following function will produce a linear chirp between frequencies f0 and f1 between t0 and t1, with an optional phi parameter to specify at what point on the cycle you want your sequence to begin:
chirp <- function(f0, f1, t0, t1, phi = 0, n_steps = 1000)
{
C <- (f1 - f0)/(t1 - t0)
x <- seq(t0, t1, length.out = n_steps)
y <- sin(2 * pi * (C / 2 * (x - t0)^2 + f0 * (x - t0)) + phi) # Ref Wikipedia
data.frame(x, y)
}
Of course, it can also produce the static first half of your plot by "chirping" between two identical frequencies, so we can get a data frame of x, y points on the plot by doing
df <- rbind(chirp(1, 1, 0, 5), chirp(1, 2, 5, 10))
Which results in:
plot(df$x, df$y, type = "l")
Note that between 5 and 10 days there are 7.5 cycles, so if you wanted to smoothly continue frequency 2, you would need to set the phi parameter to a half cycle (i.e. to pi):
df <- rbind(df, chirp(2, 2, 10, 15, phi = pi))
plot(df$x, df$y, type = "l")
Note that the phases of the chirped signal and a 2 Hz signal will only match after n seconds if the chirp occurs over an even number of periods of the original signal. For an odd number, the phase will be out by 180 degrees. This is a mathematical consequence of a linear chirp. To see this, let's use our function to chirp over 6 seconds so the phases match at 10 seconds:
plot(df$x, df$y, type = "l")
lines(df2$x, df2$y, lty = 2, col = "green")
lines(df3$x, df3$y, lty = 2, col = "blue")
lines(df$x, df$y)
I have a number of inequalities that I would like to translate to points into a 3-d coordinate system, x, y and z.
I need to find all possible points that fulfill all my inequalities, save each point, and then plot it as a 3d-scatterplot.
Simplified data example:
#Inequalities (I have several more)
df <- data.frame(
x = c(0.5, 0.4, 0.1),
y = c(0.7, 0.11, -0.25),
z = c(-0.5, -0.02, 1),
v = c(90, 2500, 350))
# Limits of the coordinate system
x.lim <- seq(-100, 100, by = 1)
y.lim <- seq(-100, 100, by = 1)
z.lim <- seq(-50, 50, by = 1)
# Basic check - must be true for all points:
df$x + df$y + df$z < df$v
# Looping through all points of the coordinate system
# no need to test row 2, # if the first row is false
df$x*-100 + df$y*-100 + df$z*-50 < df$v
# if all conditions are true, save the point to a list/matrix to be able to plot it
df$x*-99 + df$y*-100 + df$z*-50 < df$v
#...
df$x*-100 + df$y*-99 + df$z*-50 < df$v
df$x*-99 + df$y*-98 + df$z*-50 < df$v
#...
df$x*100 + df$y*100 + df$z*50 < df$v
So, in the end I would have a matrix m - that would contain all test for true variables, which would look something like this:
m
x y z
-100 -100 -50
99 -100 -50
...
100 100 50
In the end I would be able to plot all my TRUE-TRUE-TRUE results into a 3d-scatterplot such as:
plot3d(x = m[, 1], y = m[, 2], z = m[, 3], col = "blue", size = 4,
xlim = c(-100, 100), ylim = c(-100, 100), zlim = c(-50, 50))
I potentially have a lot more equations and much larger coordinate system, so speed is also a issue I am looking at. My main problem is how to write a conditional loop and look at all the different values that are available and then save the correct ones into a matrix.