dmultinom function for Multinomial distribution R - r

The function dmultinom (x, size = NULL, prob, log = FALSE) estimate probabilities of a Multinomial distribution. However, it does not run with size =1.
Theoretically, when setting size=1 the Multinomial distribution should be equivalent to the Categorical distribution.
Does anybody know why the error message?
FYI, Categorical distribution can be modelled by dist.Categorical {LaplacesDemon}.
Examples:
dmultinom(c(1,2,1),size = 1,prob = c(0.3,0.5,0.4))
Error in dmultinom(c(1, 2, 1), size = 1, prob = c(0.3, 0.5, 0.4)) :
size != sum(x)
dcat(c(1,2,1),p = c(0.3,0.5,0.4))
[1] 0.3 0.5 0.3
Thanks

LaplacesDemon::dcat and stats::dmultinom do two different things. If you have multiple observations dcat takes a vector of category values, whereas dmultinom takes a single vector response, so you have to construct a matrix of responses and use apply (or something).
library(LaplacesDemon)
probs <- c(0.3,0.5,0.2)
dcat(c(1,2,1), p = probs) ## ans: 0.3 0.5 0.3
x=matrix(c(1,0,0,
0,1,0,
1,0,0),
nrow=3,byrow=TRUE)
apply(x,1,dmultinom,size=1, prob=probs)
(I modified your example because your original probabilities, c(0.3,0.5,0.4), don't add up to 1 - neither function gives you a warning, but dmultinom automatically rescales the probabilities to sum to 1)
If I try dmultinom(c(1,2,1),p=probs, size=1) I get
size != sum(x)
that is, dmultinom is interpreting c(1,2,1) as "one sample from group 1, two samples from group 2, 1 from group 3", which isn't consistent with a total sample size of 1 ...

Related

Dealing with floating point errors in sums of probabilities [duplicate]

We know that prob argument in sample is used to assign a probability of weights.
For example,
table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.4, 0.3, 0.1)))/1e6
# 1 2 3 4
#0.2 0.4 0.3 0.1
table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.4, 0.3, 0.1)))/1e6
# 1 2 3 4
#0.200 0.400 0.299 0.100
In this example, the sum of probability is exactly 1 (0.2 + 0.4 + 0.3 + 0.1), hence it gives the expected ratio but what if the probability does not sum to 1? What output would it give? I thought it would result in an error but it gives some value.
When the probability sums up to more than 1.
table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.5, 0.5, 0.1)))/1e6
# 1 2 3 4
#0.1544 0.3839 0.3848 0.0768
table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.5, 0.5, 0.1)))/1e6
# 1 2 3 4
#0.1544 0.3842 0.3848 0.0767
When the probability sums up to less than 1
table(sample(1:4, 1e6, replace = TRUE, prob = c(0.1, 0.1, 0.5, 0.1)))/1e6
# 1 2 3 4
#0.124 0.125 0.625 0.125
table(sample(1:4, 1e6, replace = TRUE, prob = c(0.1, 0.1, 0.5, 0.1)))/1e6
# 1 2 3 4
#0.125 0.125 0.625 0.125
As we can see, running multiple times gives the output which is not equal to prob but the results are not random as well. How are the numbers distributed in this case? Where is it documented?
I tried searching on the internet but didn't find any relevant information. I looked through the documentation at ?sample which has
The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be non-negative and not all zero. If replace is true, Walker's alias method (Ripley, 1987) is used when there are more than 200 reasonably probable values: this gives results incompatible with those from R < 2.2.0.
So it says that the prob argument need not sum to 1 but doesn't tell what is expected when it doesn't sum to 1? I am not sure if I am missing any part of the documentation. Does anybody have any idea?
Good question. The docs are unclear on this, but the question can be answered by reviewing the source code.
If you look at the R code, sample always calls another R function, sample.int If you pass in a single number x to sample, it will use sample.int to create a vector of integers less than or equal to that number, whereas if x is a vector, it uses sample.int to generate a sample of integers less than or equal to length(x), then uses that to subset x.
Now, if you examine the function sample.int, it looks like this:
function (n, size = n, replace = FALSE, prob = NULL, useHash = (!replace &&
is.null(prob) && size <= n/2 && n > 1e+07))
{
if (useHash)
.Internal(sample2(n, size))
else .Internal(sample(n, size, replace, prob))
}
The .Internal means any sampling is done by calling compiled code written in C: in this case, it's the function do_sample, defined here in src/main/random.c.
If you look at this C code, do_sample checks whether it has been passed a prob vector. If not, it samples on the assumption of equal weights. If prob exists, the function ensures that it is numeric and not NA. If prob passes these checks, a pointer to the underlying array of doubles is generated and passed to another function in random.c called FixUpProbs, defined here.
This function examines each member of prob and throws an error if any elements of prob are not positive finite doubles. It then normalises the numbers by dividing each by the sum of all. There is therefore no preference at all for prob summing to 1 inherent in the code. That is, even if prob sums to 1 in your input, the function will still calculate the sum and divide each number by it.
Therefore, the parameter is poorly named. It should be "weights", as others here have pointed out. To be fair, the docs only say that prob should be a vector of weights, not absolute probabilities.
So the behaviour of the prob parameter from my reading of the code should be:
prob can be absent altogether, in which case sampling defaults to equal weights.
If any of prob's numbers are less than zero, or are infinite, or NA, the function will throw.
An error should be thrown if any of the prob values are non-numeric, as they will be interpreted as NA in the SEXP passed to the C code.
prob must have the same length as x or the C code throws
You can pass a zero probability as one or more elements of prob if you have specified replace=T, as long as you have at least one non-zero probability.
If you specify replace=F, the number of samples you request must be less than or equal to the number of non-zero elements in prob. Essentially, FixUpProbs will throw if you ask it to sample with a zero probability.
A valid prob vector will be normalised to sum to 1 and used as sampling weights.
As an interesting side effect of this behaviour, this allows you to use odds instead of probabilities if you are choosing between 2 alternatives by setting probs = c(1, odds)
As already mentioned, the weights are normalized to sum to 1 as can be demonstrated:
> x/sum(x)
[1] 0.15384615 0.38461538 0.38461538 0.07692308
This matches your simulated tabulated data:
# 1 2 3 4
#0.1544 0.3839 0.3848 0.0768

how would you count the number of elements that are true in vector?

PDF=Fr(r)=1/(1+r)^2 and Rsample=Xsample/Ysample where X,Y are independent exponential distributions with rate = 0.001.xsample=100 values stored in x,ysample=100 values stored in y.
Find the CDF FR(r) corresponding to the PDF and evaluate this at r ∈{0.1,0.2,0.25,0.5,1,2,4,5,10}. Find the proportions of values in R-sample less than each of these values of r and plot the proportions against FR(0.1), FR(0.2), ... ,FR(5),FR(10). What does this plot show?
I know that the CDF is the integral of the pdf but wouldn't this give me negative values of r.also for the proportions section how would you count the number of elements that are true, that is the number of elements for which R-sample is less than each element of r.
r=c(0.1,0.2,0.2,0.5,1,2,4,5,10)
prop=c(1:9)
for(i in 1:9)
{
x=Rsample<r[i]
prop[i]=c(TRUE,FALSE)
}
sum(prop[i])
You've made a few different errors here. The solution should look something like this.
Start by defining your variables and drawing your samples from the exponential distribution using rexp(100, 0.001):
r <- c(0.1, 0.2, 0.25, 0.5, 1, 2, 4, 5, 10)
set.seed(69) # Make random sample reproducible
x <- rexp(100, 0.001) # 100 random samples from exponential distribution
y <- rexp(100, 0.001) # 100 random samples from exponential distribution
Rsample <- x/y
The tricky part is getting the proportion of Rsample that is less than each value of r. For this we can use sapply instead of a loop.
props <- sapply(r, function(x) length(which(Rsample < x))/length(Rsample))
We get the cdf from the pdf by integrating (not shown):
cdf_at_r <- 1/(-r-1) # Integral of 1/(1+r)^2 at above values of r
And we can see what happens when we plot the proportions that are less than the sample against the cdf:
plot(cdf_at_r, props)
# What do we notice?
lines(c(-1, 0), c(0, 1), lty = 2, col = "red")
Created on 2020-03-05 by the reprex package (v0.3.0)
This is how you can count the number of elements for which R-sample is less than each element of r:
r=c(0.1,0.2,0.2,0.5,1,2,4,5,10)
prop=c(1:9)
less = 0;
for(i in 1:9)
{
if (Rsample<r[i]) {
less = less + 1
}
}
sum(prop[i])
less

R: what is the vector of quantiles in density function dvmnorm

library(mvtnorm)
dmvnorm(x, mean = rep(0, p), sigma = diag(p), log = FALSE)
The dmvnorm provides the density function for a multivariate normal distribution. What exactly does the first parameter, x represent? The documentation says "vector or matrix of quantiles. If x is a matrix, each row is taken to be a quantile."
> dmvnorm(x=c(0,0), mean=c(1,1))
[1] 0.0585
Here is the sample code on the help page. In that case are you generating the probability of having quantile 0 at a normal distribution with mean 1 and sd 1 (assuming that's the default). Since this is a multivariate normal density function, and a vector of quantiles (0, 0) was passed in, why isn't the output a vector of probabilities?
Just taking bivariate normal (X1, X2) as an example, by passing in x = (0, 0), you get P(X1 = 0, X2 = 0) which is a single value. Why do you expect to get a vector?
If you want a vector, you need to pass in a matrix. For example, x = cbind(c(0,1), c(0,1)) gives
P(X1 = 0, X2 = 0)
P(X1 = 1, X2 = 1)
In this situation, each row of the matrix is processed in parallel.

How should I specify argument "prob" when using sample() for resampling?

In short
I'm trying to better understand the argument prob as part of the function sample in R. In what follows, I both ask a question, and provide a piece of R code in connection with my question.
Question
Suppose I have generated 10,000 random standard rnorms. I then want to draw a sample of size 5 from this mother 10,000 standard rnorms.
How should I set the prob argument within the sample such that the probability of drawing these 5 numbers from the mother rnorm considers that the middle areas of the mother rnorm are denser but tail areas are thinner (so in drawing these 5 numbers it would draw from the denser areas more frequently than the tail areas)?
x = rnorm(1e4)
sample( x = x, size = 5, replace = TRUE, prob = ? ) ## what should be "prob" here?
# OR I leave `prob` to be the default by not using it:
sample( x = x, size = 5, replace = TRUE )
Overthinking is devil.
You want to resample these samples, following the original distribution or an empirical distribution. Think about how an empirical CDF is obtained:
plot(sort(x), 1:length(x)/length(x))
In other words, the empirical PDF is just
plot(sort(x), rep(1/length(x), length(x)))
So, we want prob = rep(1/length(x), length(x)) or simply, prob = rep(1, length(x)) as sample normalizes prob internally. Or, just leave it unspecified as equal probability is default.

Z-scores rounded to infinity for small p-values in R

I am working with a genome-wide association study dataset, with p-values ranging from 1E-30 to 1. I have an R data frame "data" which includes a variable "p" for the p-values.
I need to perform genomic correction of the p-values, which I am doing using the following code:
p=data$p
Zsq = qchisq(1-p, 1)
lambda = median(Zsq)/0.456
newZsq = Zsq/lambda
Newp = 1-pchisq(newZsq, 1)
In the command on the second line, where I use the qchisq function to convert p-values to z-scores, z-scores for p-values < 1E-16 are being rounded to infinity. This means the p-values for my most significant data points are rounded to 0 after the genomic correction, and I lose their ranking.
Is there any way around this?
Read help(".Machine"). Then set lower.tail=FALSE and avoid taking differences with 1:
p <- 1e-17
Zsq = qchisq(p, 1, lower.tail=FALSE)
lambda = median(Zsq)/0.456
newZsq = Zsq/lambda
Newp = pchisq(newZsq, 1, lower.tail=FALSE)
#[1] 0.4994993

Resources