R: what is the vector of quantiles in density function dvmnorm - r

library(mvtnorm)
dmvnorm(x, mean = rep(0, p), sigma = diag(p), log = FALSE)
The dmvnorm provides the density function for a multivariate normal distribution. What exactly does the first parameter, x represent? The documentation says "vector or matrix of quantiles. If x is a matrix, each row is taken to be a quantile."
> dmvnorm(x=c(0,0), mean=c(1,1))
[1] 0.0585
Here is the sample code on the help page. In that case are you generating the probability of having quantile 0 at a normal distribution with mean 1 and sd 1 (assuming that's the default). Since this is a multivariate normal density function, and a vector of quantiles (0, 0) was passed in, why isn't the output a vector of probabilities?

Just taking bivariate normal (X1, X2) as an example, by passing in x = (0, 0), you get P(X1 = 0, X2 = 0) which is a single value. Why do you expect to get a vector?
If you want a vector, you need to pass in a matrix. For example, x = cbind(c(0,1), c(0,1)) gives
P(X1 = 0, X2 = 0)
P(X1 = 1, X2 = 1)
In this situation, each row of the matrix is processed in parallel.

Related

What does the output of the function mvrnorm of MASS mean?

Using the mvrnorm() from the MASS package, now we can simulate realizations of multivariate normal distributions. This function works as follows:
library(MASS)
MASS::mvrnorm(
n = 10, # Number of realizations,
mu = c(1, 5), # Parameter vector mu,
Sigma = my_cov_matrix(1, 3, 0.2) # Parameter matrix Sigma
)
What does this output mean? Why are there two columns with ten random variables each?
The task is as follows:
Now, I created a function my_mvrnorm(n, mu_1, mu_2, sigma_1, sigma_2, rho), which simulates realizations of the corresponding multivariate normal distribution depending on mu and the matrix n and stores them in a tibble with the column names X and Y. In addition, this tibble is to contain a third column rho, in which all entries are filled with rho.
This should look like the following then:
But I couldn't write a function yet, because I don't quite understand what the values in table X and Y should be. Can someone help me?
Attempt:
my_mvrnorm <- function(n, mu_1, mu_2, sigma_1, sigma_2, rho){
mu = c(mu_1, mu_2)
sigma = my_cov_matrix(sigma_1, sigma_2, rho)
tb <- tibble(
X = ,
Y = ,
rho = rep(rho, n)
)
return(tb)
}
The n = 10 specification says do 10 samples. The mu = c(1, 5) specification says do two means. So, you get a 10 X 2 matrix as the result. If you check, the first column has a mean close to 2, and the second a mean close to 5. Is my_cov_matrix defined somewhere else?

Monte Carlo simulation from a pdf using runif

I'm given a pdf for X where f(x) = 2x when x is between 0 and 1, and f(x) = 0 otherwise. In class we learned to sample from a uniform distribution and transform the data to solve for y, however, I'm unsure how to apply that here because if I generate data from a uniform distribution then most of it will be between 0 and 1.
Am I doing these steps in the wrong order? It just seems weird to have a PDF that will lead to most of the data just being multiplied by 2.
I will use R's convention of naming PDF's with an initial d and CDF's with an initial p.
It is very simple. Compute the antiderivative of dmydist(x) = 2*x to get pmydist = sqrt(x). The associate RNG is immediate.
dmydist <- function(x) {
ifelse(x >= 0 & x <= 1, 2*x, 0)
}
pmydist <- function(y) {
ifelse(x >= 0 & x <= 1, sqrt(y), 0)
}
rmydist <- function(n) pmydist(runif(n))
set.seed(1234)
x <- rmydist(10000)
hist(x, prob = TRUE)
lines(seq(0, 1, by = 0.01), dmydist(seq(0, 1, by = 0.01)))
There are many ways how to do this. One way could be with rejection sampling https://en.wikipedia.org/wiki/Rejection_sampling. Simply put:
Sample a point on the x-axis from the proposal distribution.
Draw a vertical line at this x-position, up to the curve of the proposal distribution.
Sample uniformly along this line from 0 to the maximum of the probability density function. If the sampled value is greater than the value of the desired distribution at this vertical line, return to step 1.
n=1e5
x=runif(n)
t=runif(n)
hist(x[ifelse(2*t<2*x,T,F)])

How to draw an $\alpha$ confidence areas on a 2D-plot?

There are a lot of answers regarding to plotting confidence intervals.
I'm reading the paper by Lourme A. et al (2016) and I'd like to draw the 90% confidence boundary and the 10% exceptional points like in the Fig. 2 from the paper: .
I can't use LaTeX and insert the picture with the definition of confidence areas:
library("MASS")
library(copula)
set.seed(612)
n <- 1000 # length of sample
d <- 2 # dimension
# random vector with uniform margins on (0,1)
u1 <- runif(n, min = 0, max = 1)
u2 <- runif(n, min = 0, max = 1)
u = matrix(c(u1, u2), ncol=d)
Rg <- cor(u) # d-by-d correlation matrix
Rg1 <- ginv(Rg) # inv. matrix
# round(Rg %*% Rg1, 8) # check
# the multivariate c.d.f of u is a Gaussian copula
# with parameter Rg[1,2]=0.02876654
normal.cop = normalCopula(Rg[1,2], dim=d)
fit.cop = fitCopula(normal.cop, u, method="itau") #fitting
# Rg.hat = fit.cop#estimate[1]
# [1] 0.03097071
sim = rCopula(n, normal.cop) # in (0,1)
# Taking the quantile function of N1(0, 1)
y1 <- qnorm(sim[,1], mean = 0, sd = 1)
y2 <- qnorm(sim[,2], mean = 0, sd = 1)
par(mfrow=c(2,2))
plot(y1, y2, col="red"); abline(v=mean(y1), h=mean(y2))
plot(sim[,1], sim[,2], col="blue")
hist(y1); hist(y2)
Reference.
Lourme, A., F. Maurer (2016) Testing the Gaussian and Student's t copulas in a risk management framework. Economic Modelling.
Question. Could anyone help me and give the explanation of the variable v=(v_1,...,v_d) and G(v_1),..., G(v_d) in the equation?
I think v is the non-random matrix, the dimensions should be $k^2$ (grid points) by d=2 (dimensions). For example,
axis_x <- seq(0, 1, 0.1) # 11 grid points
axis_y <- seq(0, 1, 0.1) # 11 grid points
v <- expand.grid(axis_x, axis_y)
plot(v, type = "p")
So, your question is about the vector nu and correponding G(nu).
nu is a simple random vector drawn from any distribution that has a domain (0,1). (Here I use uniform distribution). Since you want your samples in 2D one single nu can be nu = runif(2). Given the explanations above, G is a gaussain pdf with mean 0 and a covariance matrix Rg. (Rg has dimensions of 2x2 in 2D).
Now what the paragraph says: if you have a random sample nu and you want it to be drawn from Gamma given the number of dimensions d and confidence level alpha then you need to compute the following statistic (G(nu) %*% Rg^-1) %*% G(nu) and check that is below the pdf of Chi^2 distribution for d and alpha.
For example:
# This is the copula parameter
Rg <- matrix(c(1,runif(2),1), ncol = 2)
# But we need to compute the inverse for sampling
Rginv <- MASS::ginv(Rg)
sampleResult <- replicate(10000, {
# we draw our nu from uniform, but others that map to (0,1), e.g. beta, are possible, too
nu <- runif(2)
# we compute G(nu) which is a gaussian cdf on the sample
Gnu <- qnorm(nu, mean = 0, sd = 1)
# for this we compute the statistic as given in formula
stat <- (Gnu %*% Rginv) %*% Gnu
# and return the result
list(nu = nu, Gnu = Gnu, stat = stat)
})
theSamples <- sapply(sampleResult["nu",], identity)
# this is the critical value of the Chi^2 with alpha = 0.95 and df = number of dimensions
# old and buggy threshold <- pchisq(0.95, df = 2)
# new and awesome - we are looking for the statistic at alpha = .95 quantile
threshold <- qchisq(0.95, df = 2)
# we can accept samples given the threshold (like in equation)
inArea <- sapply(sampleResult["stat",], identity) < threshold
plot(t(theSamples), col = as.integer(inArea)+1)
The red points are the points you would keep (I plot all points here).
As for drawing the decision boundries, I think it is a little bit more complicated, since you need to compute the exact pair of nu so that (Gnu %*% Rginv) %*% Gnu == pchisq(alpha, df = 2). It is a linear system that you solve for Gnu and then apply inverse to get your nu at the decision boundries.
edit: Reading the paragraph again, I noticed, the parameter for Gnu does not change, it is simply Gnu <- qnorm(nu, mean = 0, sd = 1).
edit: There was a bug: for threshold you need to use the quantile function qchisq instead of the distribution function pchisq - now corrected in the code above (and updated the figures).
This has two parts: first, compute the copula value as a function of X and Y; then, plot the curve giving the boundary where the copula exceeds the threshold.
Computing the value is basically linear algebra which #drey has answered. This is a rewritten version so that the copula is given by a function.
cop1 <- function(x)
{
Gnu <- qnorm(x)
Gnu %*% Rginv %*% Gnu
}
copula <- function(x)
{
apply(x, 1, cop1)
}
Plotting the boundary curve can be done using the same method as here (which in turn is the method used by the textbooks Modern Applied Stats with S, and Elements of Stat Learning). Create a grid of values, and use interpolation to find the contour line at the given height.
Rg <- matrix(c(1,runif(2),1), ncol = 2)
Rginv <- MASS::ginv(Rg)
# draw the contour line where value == threshold
# define a grid of values first: avoid x and y = 0 and 1, where infinities exist
xlim <- 1e-3
delta <- 1e-3
xseq <- seq(xlim, 1-xlim, by=delta)
grid <- expand.grid(x=xseq, y=xseq)
prob.grid <- copula(grid)
threshold <- qchisq(0.95, df=2)
contour(x=xseq, y=xseq, z=matrix(prob.grid, nrow=length(xseq)), levels=threshold,
col="grey", drawlabels=FALSE, lwd=2)
# add some points
data <- data.frame(x=runif(1000), y=runif(1000))
points(data, col=ifelse(copula(data) < threshold, "red", "black"))

dmultinom function for Multinomial distribution R

The function dmultinom (x, size = NULL, prob, log = FALSE) estimate probabilities of a Multinomial distribution. However, it does not run with size =1.
Theoretically, when setting size=1 the Multinomial distribution should be equivalent to the Categorical distribution.
Does anybody know why the error message?
FYI, Categorical distribution can be modelled by dist.Categorical {LaplacesDemon}.
Examples:
dmultinom(c(1,2,1),size = 1,prob = c(0.3,0.5,0.4))
Error in dmultinom(c(1, 2, 1), size = 1, prob = c(0.3, 0.5, 0.4)) :
size != sum(x)
dcat(c(1,2,1),p = c(0.3,0.5,0.4))
[1] 0.3 0.5 0.3
Thanks
LaplacesDemon::dcat and stats::dmultinom do two different things. If you have multiple observations dcat takes a vector of category values, whereas dmultinom takes a single vector response, so you have to construct a matrix of responses and use apply (or something).
library(LaplacesDemon)
probs <- c(0.3,0.5,0.2)
dcat(c(1,2,1), p = probs) ## ans: 0.3 0.5 0.3
x=matrix(c(1,0,0,
0,1,0,
1,0,0),
nrow=3,byrow=TRUE)
apply(x,1,dmultinom,size=1, prob=probs)
(I modified your example because your original probabilities, c(0.3,0.5,0.4), don't add up to 1 - neither function gives you a warning, but dmultinom automatically rescales the probabilities to sum to 1)
If I try dmultinom(c(1,2,1),p=probs, size=1) I get
size != sum(x)
that is, dmultinom is interpreting c(1,2,1) as "one sample from group 1, two samples from group 2, 1 from group 3", which isn't consistent with a total sample size of 1 ...

stochastic matrix stationary distribution

I have a large right stochastic matrix(row sums to 1).size~20000x20000. How can I find the stationary distribution of it?
I tried to calculate the eigenvalues and vectors, and get complex eigenvalues, eg.1+0i(more than one).
And try to use the following method:
pi=u[I-P+U]^-1
while when I do the inversion with solve() I got the error message Error in solve.default(A):system is computationally singular: reciprocal condition number = 3.16663e-19
As far as I understand, the Perron–Frobenius theorem ensures that every stochastic matrix as a stationary probability vector pi that the largest absolute value of an eigenvalue is always 1, so pi=piP,and my matrix has all positive entries,I can get a uniq pi,am I correct?
Or if there any method I can calculate the row vector pi?
Every stochastic matrix indeed has a stationary distribution. Since P has all row sums = 1,
(P-I) has row sums = 0 => (P-I)*(1, ...., 1) always gives you zero. So rank(P-I) <= n-1, and so is rank of transpose to P-I. Hence, there exists q such that (t(P)-I)*q = 0 => t(P)q = q.
Complex value 1+0i seems to be quite real for me. But if you get only complex values, i.e. coefficient before i is not 0, then the algorithm produces an error somewhere -- it solves the problem numerically and does not have to be true all the time. Also it does not matter how many eigenvalues and vectors it produces, what matters is that it finds the right eigenvector for eigenvalue 1 and that's what you need.
Make sure that your stationary distribution is indeed your limit distribution, otherwise there is no point in computing it. You could try to find it out by multiplying different vectors with your matrix^1000, but I don't know how much time it will take in your case.
Last but not least, here is an example:
# first we need a function that calculates matrix^n
mpot = function (A, p) {
# calculates A^p (matrix multiplied p times with itself)
# inputes: A - real-valued square matrix, p - natural number.
# output: A^p
B = A
if (p>1)
for (i in 2:p)
B = B%*%A
return (B)
}
# example matrix
P = matrix( nrow = 3, ncol = 3, byrow = T,
data = c(
0.1, 0.9, 0,
0.4, 0, 0.6,
0, 1, 0
)
)
# this converges to stationary distribution independent of start distribution
t(mpot(P,1000)) %*% c(1/3, 1/3, 1/3)
t(mpot(P,1000)) %*% c(1, 0, 0)
# is it stationary?
xx = t(mpot(P,1000)) %*% c(1, 0, 0)
t(P) %*% xx
# find stationary distribution using eigenvalues
eigen(t(P)) # here it is!
eigen_vect = eigen(t(P))$vectors[,1]
stat_dist = eigen_vect/sum(eigen_vect) # as there is subspace of them,
# but we need the one with sum = 1
stat_dist

Resources