I'm looking for a function which do the same thing as excel's CHIINV.
From Microsoft documentation, the definition of CHIINV is
Returns the inverse of the right-tailed probability of the chi-squared distribution
For example
=CHIINV(0.2,2) return 3.21
The closest function I can found in R is
geoR's dinvchisq
However,
dinvchisq(0.2,2) return 1.026062
Please help!
What you want is ?qchisq. This takes a probability and a degrees of freedom, and outputs the associated quantile. Consider:
> qchisq(p=0.2, df=2, lower.tail=FALSE)
[1] 3.218876
Furthermore, according the the documentation, dinvchisq() is the density function (the height of the pdf at a given quantile) of the inverse of the chi-squared distribution. That is, 1/dchisq(). You need the quantile function, not the density function, and you don't want the inverse of the chi-squared distribution (although the confusion seems natural coming from Excel's function).
Related
This may be a dumb question, however I don't understand why sd(dnorm(1:100, mean=50, sd=15)) doesn't return the standard deviation as [1] 15.0 instead of what it actually returns which is [1] 0.009440673. When I do this with rnorm() sd(rnorm(100, mean=50, sd=15)) it returns what I would expect which is a number close to 15: [1] 17.00682. Can someone please explain why sd(dnorm(x,mean=mean,sd=sd)) doesn't return the standard deviation that I input to dnorm?
The dnorm function returns the density of the normal distribution with the mean (50) and standard deviation (15) you gave it.
On the other hand, rnorm will sample 100 numbers over a normal distribution, that's why you get standard deviations close to 15.
It's always helpful to plot your data. If you try hist(dnorm(1:100, mean=50, sd=15)) you'll see that the variability is very small (see below). As MkWTF points out, that's because dnorm returns the value of the probability density function of the normal distribution at value x given specified mean and sd.
rnorm in contrast generates random numbers with probability given by the probability density function of the normal distribution, which is why it allows a sensible estimate of the SD - the generated values follow that distribution.
The documentation for dnorm/pnorm/qnorm/rnorm is not great in my opinion (as someone who lacks a background in mathematics), but if you spend some time reading different online resources about these functions, and ensuring that you understand the meaning of the different underlying concepts (probability density functions, quantiles, random number generation, and (cumulative) distribution functions, it will become clear over time.
hist(dnorm(1:100, mean=50, sd=15))
Created on 2020-01-09 by the reprex package (v0.3.0)
I am trying to calculate hypergeometric probabilities using phyper in R, and notice a strange behavior. I am looking at gene set overlap probabilities, and in one case, there are no "successful draws," so:
x=0
m=430
n=19500
k=2
Since in general I'm looking for over-enrichment I use:
phyper(0,600,19000,2,lower.tail=FALSE)
and get 0.043 which appears to be significant.
However, choosing two "n" genes should have probability of
19500/19930*19499/19929=0.957
so, shouldn't the phyper result be greater than 0.957?
With the code I’m calculating the density of a bivariate normal distribution. Here I use two formulas which should return the same result.
The first formula uses the dmvnorm of the mvtnorm package and the second formula uses the formula from Wikipedia (https://en.wikipedia.org/wiki/Multivariate_normal_distribution).
When the standard deviation of both distributions equals one (the covariance matrix has only ones on primary diagonal), the results are the same. But when you vary the two entries in the covariance matrix to two or one third… the results aren’t both identical.
(I hope) I have read the help properly and also this document (https://cran.r-project.org/web/packages/mvtnorm/vignettes/MVT_Rnews.pdf).
Here on stackoverflow (How to calculate multivariate normal distribution function in R) I found this because perhaps my covariance matrix is wrong defined.
But until now I couldn’t find an answer…
So my question: Why is my code returning different results when the standard deviation not equals one?
I hope I gave enough information... but when something is missing please comment. I will edit my question.
Many thanks in advance!
And now my code:
library(mvtnorm) # for loading the package if necessary
mu=c(0,0)
rho=0
sigma=c(1,1) # the standard deviation which should be changed to two or one third or… to see the different results
S=matrix(c(sigma[1],0,0,sigma[2]),ncol=2,byrow=TRUE)
x=rmvnorm(n=100,mean=mu,sigma=S)
dim(x) # for control
x[1:5,] # for visualization
# defining a function
Comparison=function(Points=x,mean=mu,sigma=S,quantity=4) {
for (i in 1:quantity) {
print(paste0("The ",i," random point"))
print(Points[i,])
print("The following two results should be the same")
print("Result from the function 'dmvnorm' out of package 'mvtnorm'")
print(dmvnorm(Points[i,],mean=mu,sigma=sigma,log=FALSE))
print("Result from equation out of wikipedia")
print(1/(2*pi*S[1,1]*S[2,2]*(1-rho^2)^(1/2))*exp((-1)/(2*(1-rho^2))*(Points[i,1]^2/S[1,1]^2+Points[i,2]^2/S[2,2]^2-(2*rho*Points[i,1]*Points[i,2])/(S[1,1]*S[2,2]))))
print("----")
print("----")
} # end for-loop
} # end function
# execute the function and compare the results
Comparison(Points=x,mean=mu,sigma=S,quantity=4)
Remember that S is the variance-covariance matrix. The formula you use from Wikipedia uses the standard deviation and not the variance. Hence you need to plug in the square root of the diagonal entries into the formula. This is also the reason why it works when you choose 1 as the diagonal entries (both the variance and the SD is 1).
See your modified code below:
library(mvtnorm) # for loading the package if necessary
mu=c(0,0)
rho=0
sigma=c(2,1) # the standard deviation which should be changed to two or one third or… to see the different results
S=matrix(c(sigma[1],0,0,sigma[2]),ncol=2,byrow=TRUE)
x=rmvnorm(n=100,mean=mu,sigma=S)
dim(x) # for control
x[1:5,] # for visualization
# defining a function
Comparison=function(Points=x,mean=mu,sigma=S,quantity=4) {
for (i in 1:quantity) {
print(paste0("The ",i," random point"))
print(Points[i,])
print("The following two results should be the same")
print("Result from the function 'dmvnorm' out of package 'mvtnorm'")
print(dmvnorm(Points[i,],mean=mu,sigma=sigma,log=FALSE))
print("Result from equation out of wikipedia")
SS <- sqrt(S)
print(1/(2*pi*SS[1,1]*SS[2,2]*(1-rho^2)^(1/2))*exp((-1)/(2*(1-rho^2))*(Points[i,1]^2/SS[1,1]^2+Points[i,2]^2/SS[2,2]^2-(2*rho*Points[i,1]*Points[i,2])/(SS[1,1]*SS[2,2]))))
print("----")
print("----")
} # end for-loop
} # end function
# execute the function and compare the results
Comparison(Points=x,mean=mu,sigma=S,quantity=4)
So your comment when you define sigma is not correct. In your code, sigma is the variances, not the standard deviations if you judge by how you construct S.
First of all, I thank you all beforehand for reading this.
I am trying to fit a Standardized T-Student Distribution (i.e. a T-Student with standard deviation = 1) on a series of data; that is: I want to estimate the degrees of freedom via Maximum Likelihood Estimation.
An example of what I need to achieve can be found in the following (simple) Excel file I made:
https://www.dropbox.com/s/6wv6egzurxh4zap/Excel%20Implementation%20Example.xlsx?dl=0
Inside the Excel file, I have an image that contains the formula corresponding to the calculation of the loglikelihood function for the Standardized T Student Distribution. The formula was extracted from a Finance book (Elements of Financial Risk Management - by Peter Christoffersen).
So far, I have tried this with R:
copula.data <- read.csv(file.choose(),header = TRUE)
z1 <- copula.data[,1]
library(fitdistrplus)
ft1 = fitdist(z1, "t", method = "mle", start = 10)
df1=ft1$estimate[1]
df1
logLik(ft1)
df1 yields the number: 13.11855278779897
logLike(ft1) yields the number: -3600.2918050056487
However, the Excel file yields degrees of freedom of: 8.2962365022727, and a log-likelihood of: -3588.8879 (which is the right answer).
Note: the .csv file that my code reads is the following:
https://www.dropbox.com/s/nnh2jgq4fl6cm12/Data%20for%20T%20Copula.csv?dl=0
Any ideas? Thank you people!
The formula from your spreadsheet (with n, x substituted for the df parameter and the data)
=GAMMALN((n+1)/2)-GAMMALN(n/2)-LN(PI())/2-LN(n-2)/2-1/2*(1+n)*LN(1+x^2/(n-2))
or, exponentiating,
Gamma((n+1)/2) / (sqrt((n-2) pi) Gamma(n/2)) (1+x^2/(n-2))^-((n+1)/2)
?dt gives
f(x) = Gamma((n+1)/2) / (sqrt(n pi) Gamma(n/2)) (1 + x^2/n)^-((n+1)/2)
So the difference lies in those n-2 values in two places in the formula. I don't have enough context to see why the author is defining the t distribution in that different way; there may be some good reason ...
Looking at the negative log-likelihood curve directly, it certainly seems as though the fitdistrplus answer is agreeing with the direct calculation. (It would be very surprising if there were a bug in the dt() function, R's distribution functions are very broadly used and thoroughly tested.)
LL <- function(p,data=z1) {
-sum(dt(data,df=p,log=TRUE))
}
pvec <- seq(6,20,by=0.05)
Lvec <- sapply(pvec,LL)
par(las=1,bty="l")
plot(pvec,Lvec,type="l",
xlab="df parameter",ylab="negative log-likelihood")
## superimpose fitdistr results ...
abline(v=coef(ft1),lty=2)
abline(h=-logLik(ft1),lty=2)
Unless there's something else you're not telling us about the problem definition, it seems to me that R is getting the right answer. (The mean and sd of the data you gave were not exactly equal to 0 and 1 respectively, but they were close; centering and scaling gave an even larger value for the parameter.)
I am using the OptimalCutpoints package in R to find the optimal cutoff point from ROC curve. The criterion for finding the optimal threshold is maximizing Youden's index:
J = sensitivity + specificity - 1
I am trying to do the same in matlab with the function perfcurve. I run perfcurve with the default criteria for two axis, the FPR in x-coordinates and TPR in y-coordinates. The perfcurve returns a matrix with thresholds and chooses one of them according to the criteria.
The problem is that the optimal threshold that matlab gives, is not the same as in R. However, the optimal threshold according to R is included in the threshold matrix that matlab returns.
How can I replicate the results that R returns with the ones in matlab? I am suspecting that the criteria are not correctly set in matlab for Youden's index.
If you look at the documentation for perfcurve (specifically the OPTROCPT row), you would see that the formula that matlab uses to find the best threshold is quite different, and includes a cost matrix in the optimality criterion.
If you want to replicate what is done in R exactly, use the X and Y return values to compute the Youden index for each threshold, and then choose the best (see how to find max and it's index in array in matlab for some idea how to do it).