Why doesn't "dnorm" sum up to one as a probability? - r

This may be some basic/fundamental question on 'dnorm' function in R. Let's say I create some z scores through z transformation and try to get the sum out of 'dnorm'.
data=c(232323,4444,22,2220929,22323,13)
z=(data-mean(data))/sd(data)
result=dnorm(z,0,1)
sum(result)
[1] 1.879131
As above, the sum of 'dnorm' is not 1 nor 0.
Then let's say I use zero mean and one standard deviation even in my z transformation.
data=c(232323,4444,22,2220929,22323,13)
z=(data-0)/1
result=dnorm(z,0,1)
sum(result)
[1] 7.998828e-38
I still do not get either 0 or 1 in sum.
If my purpose is to get sum of the probability equal to one as I will need for my further usage, what method do you recommend using 'dnorm' or even using other PDF functions?

dnorm returns the values evaluated in the normal probability density function. It does not return probabilities. What is your reasoning that the sum of your transformed data evaluated in the density function should equate to one or zero? You're creating a random variable, there is no reason it should ever equate exactly zero or one.
Integrating dnorm yields a probability. Integrating dnorm over the entire support of the random variable yields a probability of one:
integrate(dnorm, -Inf, Inf)
#1 with absolute error < 9.4e-05
In fact, integrate(dnorm, -Inf, x) conceptually equals pnorm(x) for all x.
Edit: In light of your comment.
The same applies to other continuous probability distributions (PDFs):
integrate(dexp, 0, Inf, rate = 57)
1 with absolute error < 1.3e-05
Note that the ... argument(s) from ?integrate is passed to the integrand.
Recall also that the Poisson distribution, say, is a discrete probability distribution and hence integrating it (in the conventional sense) makes no sense. A discrete probability distribution have a probability mass function (PMF) and not a PDF which actually return probabilities. In that case, it should sum to one.
Consider:
dpois(0.5, lambda = 2)
#[1] 0
#Warning message:
#In dpois(0.5, lambda = 2) : non-integer x = 0.500000
Summing from 0 to a 'very' large number (i.e. over the support of the Poisson distribution):
sum(dpois(0:1000000, lambda = 2))
#[1] 1

Related

optim() when optimizing for one parameter in R

I'm trying to find the single value of scale in function bayesmeta::qhalfnormal such that the first and the second elements of the vector low_high <- c(.1, 1) have .025 and .975 probability of happening in it, respectively.
In other words, for what value of scale .1 can have .025 and 1 can have .975 probability.
So, I have one parameter (scale) to optimize, and expect a single value for it. I'm using optim below but this way, I get two values for scale.
Is there a better optimization function to give me a single value for scale?
library(bayesmeta)
low_high <- c(.1, 1)
alpha <- c(.025, .975)
f <- function(x) {
low_high - qhalfnormal(alpha, scale = x) }
optim(low_high, function(x)sum(f(x)^2))
# $par
> [1] 3.1939758 0.4461607 # I expect a single value for `scale`
# But it seems `optim()` has acted like `Vectorize(optimize)` looping over
# elements of `low_high` vector.
#anonymous.asker is correct that passing a vector of length 2 is confusing optim(). What is happening is that qhalfnormal() is vectorizing both over the quantiles and the vector of scale values you gave it: e.g. qhalfnormal(c(0.1, 1), c(0.025, 0.975)) returns a two-element vector comprising (1) the 0.025 quantile for a scale parameter of 0.1 and (2) the 0.975 quantile for a scale parameter of 1. These then get collapsed to a single output value by the sum-of-squares operation ... (what you wanted, I think, was to evaluate qhalfnormal() for both quantile levels for a single scale parameter).
If you specify a single value that is close enough to the true value you get an answer, and a warning suggesting that you not use Nelder-Mead:
optim(0.45, function(x)sum(f(x)^2))
If your starting value is too far from the solution you get a warning and an error as soon as the algorithm tries a negative value for the parameter ("scale > 0 is not TRUE").
The sensible way to do this is to specify method="Brent" (as suggested by the warning, at which point you also need to specify bounds:
optim(1, function(x)sum(f(x)^2), method="Brent", lower=0, upper=10)
This returns 0.4461; this is indeed the argmin (parameter corresponding to the minimum value) for this problem. As #Onyambu points out in comments though, it doesn't really solve the larger problem (which is to try to reduce both values to 0); it solves the problem as posed, which is to minimize the objective function ...
You are passing a vector of length 2 as the initial estimate. If you want to set bounds for your variable, that's under different arguments in optim.

Defining exponential distribution in R to estimate probabilities

I have a bunch of random variables (X1,....,Xn) which are i.i.d. Exp(1/2) and represent the duration of time of a certain event. So this distribution has obviously an expected value of 2, but I am having problems defining it in R. I did some research and found something about a so-called Monte-Carlo Stimulation, but I don't seem to find what I am looking for in it.
An example of what i want to estimate is: let's say we have 10 random variables (X1,..,X10) distributed as above, and we want to determine for example the probability P([X1+...+X10<=25]).
Thanks.
You don't actually need monte carlo simulation in this case because:
If Xi ~ Exp(λ) then the sum (X1 + ... + Xk) ~ Erlang(k, λ) which is just a Gamma(k, 1/λ) (in (k, θ) parametrization) or Gamma(k, λ) (in (α,β) parametrization) with an integer shape parameter k.
From wikipedia (https://en.wikipedia.org/wiki/Exponential_distribution#Related_distributions)
So, P([X1+...+X10<=25]) can be computed by
pgamma(25, shape=10, rate=0.5)
Are you aware of rexp() function in R? Have a look at documentation page by typing ?rexp in R console.
A quick answer to your Monte Carlo estimation of desired probability:
mean(rowSums(matrix(rexp(1000 * 10, rate = 0.5), 1000, 10)) <= 25)
I have generated 1000 set of 10 exponential samples, putting them into a 1000 * 10 matrix. We take row sum and get a vector of 1000 entries. The proportion of values between 0 and 25 is an empirical estimate of the desired probability.
Thanks, this was helpful! Can I use replicate with this code, to make it look like this: F <- function(n, B=1000) mean(replicate(B,(rexp(10, rate = 0.5)))) but I am unable to output the right result.
replicate here generates a matrix, too, but it is an 10 * 1000 matrix (as opposed to a 1000* 10 one in my answer), so you now need to take colSums. Also, where did you put n?
The correct function would be
F <- function(n, B=1000) mean(colSums(replicate(B, rexp(10, rate = 0.5))) <= n)
For non-Monte Carlo method to your given example, see the other answer. Exponential distribution is a special case of gamma distribution and the latter has additivity property.
I am giving you Monte Carlo method because you name it in your question, and it is applicable beyond your example.

Approximation of results

I'm just learning how to use R. I'm practicing some statistic stuff, as Normal distribution, Poisson, etc.
When I try to calculate probabilities and the answer is a number very close to zero (0), the program shows as result 0, so I can't see the full answer, and I need the full answer. There is always a probability, even a small one!!
My question is: can I turn off the self-approximation or which code can I use to get a full answer?
Example:
1-pbinom(q =10, size = 10,prob = 0.8)
Result:
0
The pbinom function gives the cumulative density function. That i the probability that a value is less than or equal to a particular value. So with a discrete distribution like the binomial distribution with 10 draws
pbinom(10, 10, .8)
# [1] 1
tells you that there is a 100% change you will observe 10 or fewer successes.
Perhaps you're thinking of the probability density function (or probability mass function since this is a discrete distribution) dbinom
dbinom(10, 10, .8)
# [1] 0.1073742
means that there is a roughly 11% chance that all your draws will be successes. It's also true that
sum(dbinom(0:10, 10, .8))
# [1] 1
that the sum of the probabilities of getting 0 through is exactly 1.
So with these cases you are getting the exact answer. R does round values in the console according to the options(digits=) setting, but that's not what's happening here.
pbinom is the distribution function for the binomial distribution, which is discrete and can thus be exactly 1 (as in your example). You might have been thinking of continuous distributions like the normal or gamma distributions. In this case, rounding can cause your results to be truncated, for example
> 1 - pnorm(10, 0, 1)
[1] 0
However, the p[dist] functions have an argument lower.tail=FALSE designed to address this problem:
> pnorm(10, 0, 1, lower.tail=FALSE)
[1] 7.619853e-24

Converting Optim to constrOptim in R

I am trying to determine the weights of 9 metrics which will return the highest accuracy ratio. Since they are weights, the values need to sum to 1 and lie between 0 & 1. I am currently using the optim function, but do to constraints, I think I need to switch to constrOptim. I was wondering the best way to do this. Below I have included the code i am currently using. x.matrix is 20,000 by 9 matrix of values ranked between 1-10.
pars<-c(w1=(1/9),w2=(1/9),w3=(1/9),w4=(1/9),w5=(1/9),w6=(1/9),w7=(1/9),w8=(1/9),w9=(1/9))
OptPars<-function(pars){(-(rcorr.cens(x.matrix%*%pars),f)["Dxy"])}
opt<-optim(pars,OptPars)
Say you have values x on the range (-Inf, Inf) and you need values p in the range [0,1] that sum to 1, you can do the following transformation
p <- exp(x)/sum(exp(x))
If you do that translation in your optimization function and do the same transformation on the best set of parameters, you should get what you want.

Combining two normal random variables

suppose I have the following 2 random variables :
X where mean = 6 and stdev = 3.5
Y where mean = -42 and stdev = 5
I would like to create a new random variable Z based on the first two and knowing that : X happens 90% of the time and Y happens 10% of the time.
It is easy to calculate the mean for Z : 0.9 * 6 + 0.1 * -42 = 1.2
But is it possible to generate random values for Z in a single function?
Of course, I could do something along those lines :
if (randIntBetween(1,10) > 1)
GenerateRandomNormalValue(6, 3.5);
else
GenerateRandomNormalValue(-42, 5);
But I would really like to have a single function that would act as a probability density function for such a random variable (Z) that is not necessary normal.
sorry for the crappy pseudo-code
Thanks for your help!
Edit : here would be one concrete interrogation :
Let's say we add the result of 5 consecutives values from Z. What would be the probability of ending with a number higher than 10?
But I would really like to have a
single function that would act as a
probability density function for such
a random variable (Z) that is not
necessary normal.
Okay, if you want the density, here it is:
rho = 0.9 * density_of_x + 0.1 * density_of_y
But you cannot sample from this density if you don't 1) compute its CDF (cumbersome, but not infeasible) 2) invert it (you will need a numerical solver for this). Or you can do rejection sampling (or variants, eg. importance sampling). This is costly, and cumbersome to get right.
So you should go for the "if" statement (ie. call the generator 3 times), except if you have a very strong reason not to (using quasi-random sequences for instance).
If a random variable is denoted x=(mean,stdev) then the following algebra applies
number * x = ( number*mean, number*stdev )
x1 + x2 = ( mean1+mean2, sqrt(stdev1^2+stdev2^2) )
so for the case of X = (mx,sx), Y= (my,sy) the linear combination is
Z = w1*X + w2*Y = (w1*mx,w1*sx) + (w2*my,w2*sy) =
( w1*mx+w2*my, sqrt( (w1*sx)^2+(w2*sy)^2 ) ) =
( 1.2, 3.19 )
link: Normal Distribution look for Miscellaneous section, item 1.
PS. Sorry for the wierd notation. The new standard deviation is calculated by something similar to the pythagorian theorem. It is the square root of the sum of squares.
This is the form of the distribution:
ListPlot[BinCounts[Table[If[RandomReal[] < .9,
RandomReal[NormalDistribution[6, 3.5]],
RandomReal[NormalDistribution[-42, 5]]], {1000000}], {-60, 20, .1}],
PlotRange -> Full, DataRange -> {-60, 20}]
It is NOT Normal, as you are not adding Normal variables, but just choosing one or the other with certain probability.
Edit
This is the curve for adding five vars with this distribution:
The upper and lower peaks represent taking one of the distributions alone, and the middle peak accounts for the mixing.
The most straightforward and generically applicable solution is to simulate the problem:
Run the piecewise function you have 1,000,000 (just a high number) of times, generate a histogram of the results (by splitting them into bins, and divide the count for each bin by your N (1,000,000 in my example). This will leave you with an approximation for the PDF of Z at every given bin.
Lots of unknowns here, but essentially you just wish to add the two (or more) probability functions to one another.
For any given probability function you could calculate a random number with that density by calculating the area under the probability curve (the integral) and then generating a random number between 0 and that area. Then move along the curve until the area is equal to your random number and use that as your value.
This process can then be generalized to any function (or sum of two or more functions).
Elaboration:
If you have a distribution function f(x) which ranges from 0 to 1. You could calculate a random number based on the distribution by calculating the integral of f(x) from 0 to 1, giving you the area under the curve, lets call it A.
Now, you generate a random number between 0 and A, let's call that number, r. Now you need to find a value t, such that the integral of f(x) from 0 to t is equal to r. t is your random number.
This process can be used for any probability density function f(x). Including the sum of two (or more) probability density functions.
I'm not sure what your functions look like, so not sure if you are able to calculate analytic solutions for all this, but worse case scenario, you could use numeric techniques to approximate the effect.

Resources