I want to calculate the integral of the Normal Distribution at exactly some point - I know that to do this, this is the equivalent of integrating the Normal Distribution at that point and at some point slightly after that point : then, you can subtract both of these values and get an approximate answer.
I tried doing this in R:
a = pnorm(1.96, mean = 0, sd = 1, log = FALSE)
b = pnorm(1.961, mean = 0, sd = 1, log = FALSE)
final_answer = b - a
#5.83837e-05
Is it possible to do this in one step instead of manually subtracting "a" and "b"?
Thank you!
We need to be clear about what you are asking here. If you are looking for the integral of a normal distribution at a specific point, then you can use pnorm, which is the anti-derivative of dnorm.
We can see this by reversing the process and looking at the derivative of pnorm to ensure it matches dnorm:
# Numerical approximation to derivative of pnorm:
delta <- 10^-6
(pnorm(0.75 + delta) - pnorm(0.75)) / delta
#> [1] 0.3011373
Note that this is a very close approximation of dnorm
dnorm(0.75)
#> [1] 0.3011374
So the anti-derivative of a normal distribution density at point x is given by:
pnorm(x)
You can try this
> diff(pnorm(c(1.96, 1.961), mean = 0, sd = 1, log = FALSE))
[1] 5.83837e-05
Related
It is the case that the probability density for a standardized and unstandardized random variable will differ. E.g., in R
dnorm(x = 0, mean = 1, sd = 2)
dnorm(x = (0 - 1)/2)
However,
pnorm(q = 0, mean = 1, sd = 2)
pnorm(q = (0 - 1)/2)
yields the same value.
Are there any situations in which the Normal cumulative density function will yield a different probability for the same random variable when it is standardized versus unstandardized? If yes, is there a particular example in which this difference arises? If not, is there a general proof of this property?
Thanks so much for any help and/or insight!
This isn't really a coding question, but I'll answer it anyway.
Short answer: yes, they may differ.
Long answer:
A normal distribution is usually thought of as y=f(x), that is, a curve over the domain of x. When you standardize, you are converting from units of x to units of z. For example, if x~N(15,5^2), then a value of 10 is 5 x-units less than the mean. Notice that this is also 1 standard deviation less than the mean. When you standardize, you convert x to z~N(0,1^2). Now, that example value of 10, when standarized into z-units, becomes a value of -1 (i.e., it's still one standard deviation less than the mean).
As a result, the area under the curve to the left of x=10 is the same as the area under the curve to the left of z=-1. In words, the cumulative probability up to those cut-offs is the same.
However, the height of curves is different. Let the normal distribution curves be f(x) and g(z). Then f(10) != g(-1). In code:
dnorm(10, 15, 5) != dnorm(-1, 0, 1)
The reason is that the act of standardizing either "spreads" or "squishes" the f(x) curve to make it "fit" over the new z domain as g(z).
Here are two links that let you visualize the spreading/squishing:
https://academo.org/demos/gaussian-distribution/
https://www.intmath.com/counting-probability/normal-distribution-graph-interactive.php
Hope this helps!
I have no sample and I'd like to compute the variance, mean, median, and mode of a distribution which I only have a vector with it's density and a vector with it's support. Is there an easy way to compute this statistics in R with this information?
Suppose that I only have the following information:
Support
Density
sum(Density) == 1 #TRUE
length(Support)==length(Density)# TRUE
You have to do weighted summations
F.e., starting with #Johann example
set.seed(312345)
x = rnorm(1000, mean=10, sd=1)
x_support = density(x)$x
x_density = density(x)$y
plot(x_support, x_density)
mean(x)
prints
[1] 10.00558
and what, I believe, you're looking for
m = weighted.mean(x_support, x_density)
computes mean as weighted mean of values, producing output
10.0055796130192
There are weighted.sd, weighted.sum functions which should help you with other quantities you're looking for.
Plot
If you don't need a mathematical solution, and an empirical one is all right, you can achieve a pretty good approximation by sampling.
Let's generate some data:
set.seed(6854684)
x = rnorm(50,mean=10,sd=1)
x_support = density(x)$x
x_density = density(x)$y
# see our example:
plot(x_support, x_density )
# the real mean of x
mean(x)
Now to 'reverse' the process we generate a large sample from that density distribution:
x_sampled = sample(x = x_support, 1000000, replace = T, prob = x_density)
# get the statistics
mean(x_sampled)
median(x_sampled)
var(x_sampled)
etc...
Mathematically the following is impossible for
library(truncdist)
q = function(x, L, R ) dtrunc(x, "exp", rate=0.1, a=L,b=R)
integrate(q, L=2, R=3, lower =0, upper = 27 )
integrate(q, L=2, R=3, lower =0, upper = 29 )
integrate(q, L=2, R=3, lower =27, upper = 29 )
integrate(q, L=2, R=3, lower =0, upper = 30 )
We found the first integral giving a positive number, the second one evaluating to zero by adding the third interval which integrates itself to zero. Is this an issue in integrate or truncdist?
We can use the following to find more such issues
z=numeric()
for(i in 1:50){
z[i]=integrate(q, L=2, R=3, lower =0, upper = i)$value
}
What do I need to do to find the correct integrals (which all are 1 when integrating from 0 to i>=3)?
From help("integrate"):
Like all numerical integration routines, these evaluate the function on a finite set of points. If the function is approximately constant (in particular, zero) over nearly all its range it is possible that the result and error estimate may be seriously wrong.
You found an example of this:
curve(q(x, 2, 3), from = -1, to = 30)
You shouldn't integrate distribution density functions numerically. Use the cumulative distribution function:
diff(ptrunc(c(0, 29), "exp", rate = 0.1, a = 2, b = 3))
#[1] 1
I have found an alternative answer in this post: Integration in R with integrate function
Using hcubature the problem can be solved numerically, which is a closer answer to my original question.
library(mvtnorm)
dmvnorm(x, mean = rep(0, p), sigma = diag(p), log = FALSE)
The dmvnorm provides the density function for a multivariate normal distribution. What exactly does the first parameter, x represent? The documentation says "vector or matrix of quantiles. If x is a matrix, each row is taken to be a quantile."
> dmvnorm(x=c(0,0), mean=c(1,1))
[1] 0.0585
Here is the sample code on the help page. In that case are you generating the probability of having quantile 0 at a normal distribution with mean 1 and sd 1 (assuming that's the default). Since this is a multivariate normal density function, and a vector of quantiles (0, 0) was passed in, why isn't the output a vector of probabilities?
Just taking bivariate normal (X1, X2) as an example, by passing in x = (0, 0), you get P(X1 = 0, X2 = 0) which is a single value. Why do you expect to get a vector?
If you want a vector, you need to pass in a matrix. For example, x = cbind(c(0,1), c(0,1)) gives
P(X1 = 0, X2 = 0)
P(X1 = 1, X2 = 1)
In this situation, each row of the matrix is processed in parallel.
I have a question regarding the definitions std and t of the student-t distribution in R. std comes with the rugarch package, while t is from the stats package. When plotting
qqplot(qstd(c(1:1000)/1001, nu=5),qt(c(1:1000)/1001,df=5))
abline(0,1)
it is clear to see that the two definitions are different. Can anyone tell me why there is this difference and which one of the functions gives the correct values?
If you open up the qstd function (which is actually from fGarch), you'll see that it's modifying stats::qt:
> qstd
function (p, mean = 0, sd = 1, nu = 5)
{
s = sqrt(nu/(nu - 2))
result = qt(p = p, df = nu) * sd/s + mean
result
}
<environment: namespace:fGarch>
So, what it's giving you is a "non-standardized Student's t-distribution" as opposed to the standardized distribution available from stats. If s=1 and as degrees of freedom goes to infinity, they will produce the same result.