How to treat p value in R ?
I am expecting very low p values like:
1.00E-80
I need to -log10
-log10(1.00E-80)
-log10(0) is Inf, but Inf at sense of rounding too.
But is seems that after 1.00E-308, R yields 0.
1/10^308
[1] 1e-308
1/10^309
[1] 0
Is the accuracy of p-value display with lm function the same as the cutoff point, 1e-308, or it is just designed such that we need a cutoff point and I need to consider a different cutoff point - such as 1e-100 (for example) to replace 0 with <1e-100.
There are a variety of possible answers -- which one is most useful depends on the context:
R is indeed incapable under ordinary circumstances of storing floating-point values closer to zero than .Machine$double.xmin, which varies by platform but is typically (as you discovered) on the order of 1e-308. If you really need to work with numbers this small and can't find a way to work on the log scale directly, you need to search Stack Overflow or the R wiki for methods for dealing with arbitrary/extended precision values (but you probably should try to work on the log scale -- it will be much less of a hassle)
in many circumstances R actually computes p values on the (natural) log scale internally, and can if requested return the log values rather than exponentiating them before giving the answer. For example, dnorm(-100,log=TRUE) gives -5000.919. You can convert directly to the log10 scale (without exponentiating and then using log10) by dividing by log(10): dnorm(-100,log=TRUE)/log(10)=-2171, which would be too small to represent in floating point. For the p*** (cumulative distribution function) functions, use log.p=TRUE rather than log=TRUE. (This particular point depends heavily on your particular context. Even if you are not using built-in R functions you may be able to find a way to extract results on the log scale.)
in some cases R presents p-value results as being <2.2e-16 even when a more precise value is known: (t1 <- t.test(rnorm(10,100),rnorm(10,80)))
prints
....
t = 56.2902, df = 17.904, p-value < 2.2e-16
but you can still extract the precise p-value from the result
> t1$p.value
[1] 1.856174e-18
(in many cases this behaviour is controlled by the format.pval() function)
An illustration of how all this would work with lm:
d <- data.frame(x=rep(1:5,each=10))
set.seed(101)
d$y <- rnorm(50,mean=d$x,sd=0.0001)
lm1 <- lm(y~x,data=d)
summary(lm1) prints the p-value of the slope as <2.2e-16, but if we use coef(summary(lm1)) (which does not use the p-value formatting), we can see that the value is 9.690173e-203.
A more extreme case:
set.seed(101); d$y <- rnorm(50,mean=d$x,sd=1e-7)
lm2 <- lm(y~x,data=d)
coef(summary(lm2))
shows that the p-value has actually underflowed to zero. However, we can still get an answer on the log scale:
tval <- coef(summary(lm2))["x","t value"]
2*pt(abs(tval),df=48,lower.tail=FALSE,log.p=TRUE)/log(10)
gives -692.62 (you can check this approach with the previous example where the p-value doesn't overflow and see that you get the same answer as printed in the summary).
Small numbers are generally hard to deal with.
The limit in R for infinite is caused by the use of double precision floating point :
?double All R platforms are required to work with values conforming to the IEC 60559 (also known as IEEE 754) standard. This basically works with a precision of 53 bits, and represents to that precision a range of absolute values from about 2e-308 to 2e+308.
http://en.wikipedia.org/wiki/Double_precision_floating-point_format
You may find the Rmpfr package helpful here as it allows you to create multiple precision numbers.
install.packages("Rmpfr")
require(Rmpfr)
log(mpfr(1/10^309, precBits=500))
Related
Perhaps this is a philosophical question rather than a programming question, but here goes...
In R, is there some package or method that will let you deal with "less than"s as a concept?
Backstory: I have some data which, for privacy reasons, is given as <5 for small numbers (representing integers 1, 2, 3 or 4, in fact). I'd like to do some simple arithmetic on this data (adding, subtracting, averaging, etc.) but obviously I need to find some way to deal with these <5s conceptually. I could replace them all with NAs, sure, but of course that's throwing away potentially useful information, and I would like to avoid that if possible.
Some examples of what I mean:
a <- c(2,3,8)
b <- c(<5,<5,8)
mean(a)
> 4.3333
mean(b)
> 3.3333 -> 5.3333
If you are interested in the values at the bounds, I would take each dataset and split it into two datasets; one with all <5s set to 1 and one with all <5s set to 4.
a <- c(2,3,8)
b1 <- c(1,1,8)
b2 <- c(4,4,8)
mean(a)
# 4.333333
mean(b1)
# 3.3333
mean(b2)
# 5.3333
Following #hedgedandlevered proposal, but he's wrong wrt normal and/or uniform. You ask for integer numbers, so you have to use discrete distributions, like Poisson, binomial (including negative one), geometric etc
In statistics "less than" data is known as "left censored" https://en.wikipedia.org/wiki/Censoring_(statistics), searching on "censored data" might help.
My favoured approach to analysing such data is maximum likelihood https://en.wikipedia.org/wiki/Maximum_likelihood. There are a number of R packages for maximum likelihood estimation, I like the survival package https://cran.r-project.org/web/packages/survival/index.html but there are others, e.g. fitdistrplus https://cran.r-project.org/web/packages/fitdistrplus/index.html which "provides functions for fitting univariate distributions to different types of data (continuous censored or non-censored data and discrete data) and allowing different estimation methods (maximum likelihood, moment matching, quantile matching and maximum goodness-of-t estimation)".
You will have to specify (assume?) the form of the distribution of the data; you say it is integer so maybe a Poisson [related] distribution may be appropriate.
Treat them as a certain probability distribution of your choosing, and replace them with actual randomly generated numbers. All equal to 2.5, normal-like distribution capped at 0 and 5, uniform on [0,5] are all options
I deal with similar data regularly. I strongly dislike any of the suggestions of replacing the <5 values with a particular number. Consider the following two cases:
c(<5,<5,<5,<5,<5,<5,<5,<5,6,12,18)
c(<5,6,12,18)
The problem comes when you try to do arithmetic with these.
I think a solution to your issue is to think of the values as factors (in the R sense. You can bin the values above 5 too if that helps, for example
c(<5,<5,<5,<5,<5,<5,<5,<5,5-9,10-14,15-19)
c(<5,5-9,10-14,15-19)
Now, you still wouldn't do arithmetic on these, but your summary statistics (histograms/proportion tables/etc...) would make more sense.
I would like to use optimize(), or something similar, to search for a minimum / maximum value of a function. However I am unsure of about the exact range over which the function should be optimized, which is a required parameter for the function 'optimze()' (e.g. optimize(f=FUN,interval=c(lowerBound,upperBound))).
In this optimization problem, I am able to estimate a value a that is "close" to the optimal solution, but "closeness" depends on the situation.
Is there a function in R that can use the initial value a that does not require that the interval over which the function is optimized to be specified up front?
When you say you're not sure about the lower limit, I suspect that this means that the parameter you are trying to estimate is not bounded below.
If this the case, one trick is to transform the function so that there is a lower bound on the parameter.
This trivial function has a minimum at x=4:
fun <- function(x) -exp(-(x - 4)^2) + 8
which we can find via:
optimize(f=fun,interval=c(0,8))
#> $minimum
#> [1] 4
but let's pretend for a moment that we're not sure if there is a lower limit or not, and that we know that the upper limit is 8. R will throw an error if we try:
optimize(f=fun,interval=c(-Inf,8))
because the bounds must be finite. In this case, we can use the exponential transformation (exp()) which maps
the real numbers to the positive numbers, like so:
optimize(f=function(x)fun(log(x)),
interval=exp(c(-Inf,8)))
#> $minimum
#> [1] 54.59815
and then to get the root, you just need to back transform the above the solution via:
log(54.59815)
#> 4
If you don't know either the upper or lower bound on the underlying parameter, then you can use the log-odds transformation in place of the log():
function(x) log(x/(1-x))
and it's inverse in place of exp():
function(y) exp(y)/(1 + exp(y))
Note that the log-odds transformation maps the real numbers onto the unit interval, so the interval parameter becomes 0:1.
These solutions do have some numerical limitations (e.g. if we had set interval=exp(c(-Inf,16)) in the first solution, we would have gotten an error). Tip, you can re-scale these transformations to center around a given point a which can reduce the numerical limitations.
I am working with financial/economic data in case you are wondering about the large size of some of the coefficients below... My general question has to do with the simulation of parameter coefficients output from a linear random effects model in R. I am attempting to generate a random sample of beta coefficients using the model coefficients and the variance-covariance (VCOV) matrix from the same model in R. My question is: Why am I receiving the error below about the square root of the expected values using the rmvnorm() function from the mvtnorm{} package? How can I deal with this warning/issue?
#Example call: lmer model with random effects by YEAR
#mlm<-lmer(DV~V1+V2+V3+V2*V3+V4+V5+V6+V7+V8+V9+V10+V11+(1|YEAR), data=dat)
#Note: 5 years (5 random effects total)
#LMER call yields the following information:
coef<-as.matrix(c(-28037800,0.8368619,2816347,8681918,-414002.6,371010.7,-26580.84,80.17909,271.417,-239.1172,3.463785,-828326))
sigma<-as.matrix(rbind(c(1834279134971.21,-415.95,-114036304870.57,-162630699769.14,-23984428143.44,-94539802675.96,
-4666823087.67,-93751.98,1735816.34,-1592542.75,3618.67,14526547722.87),
c(-415.95,0.00,41.69,94.17,-8.94,-22.11,-0.55,0.00,0.00,0.00,0.00,-7.97),
c(-114036304870.57,41.69,12186704885.94,12656728536.44,-227877587.40,-2267464778.61,
-4318868.82,8909.65,-355608.46,338303.72,-321.78,-1393244913.64),
c(-162630699769.14,94.17,12656728536.44,33599776473.37,542843422.84,4678344700.91,-27441015.29,
12106.86,-225140.89,246828.39,-593.79,-2445378925.66),
c(-23984428143.44,-8.94,-227877587.40,542843422.84,32114305557.09,-624207176.98,-23072090.09,
2051.16,51800.37,-49815.41,-163.76,2452174.23),
c(-94539802675.96,-22.11,-2267464778.61,4678344700.91,-624207176.98,603769409172.72,90275299.55,
9267.90,208538.76,-209180.69,-304.18,-7519167.05),
c(-4666823087.67,-0.55,-4318868.82,-27441015.29,-23072090.09,90275299.55,82486186.42,-100.73,
15112.56,-15119.40,-1.34,-2476672.62),
c(-93751.98,0.00,8909.65,12106.86,2051.16,9267.90,-100.73,2.54,8.73,-10.15,-0.01,-1507.62),
c(1735816.34,0.00,-355608.46,-225140.89,51800.37,208538.76,15112.56,8.73,527.85,-535.53,-0.01,21968.29),
c(-1592542.75,0.00,338303.72,246828.39,-49815.41,-209180.69,-15119.40,-10.15,-535.53,545.26,0.01,-23262.72),
c(3618.67,0.00,-321.78,-593.79,-163.76,-304.18,-1.34,-0.01,-0.01,0.01,0.01,42.90),
c(14526547722.87,-7.97,-1393244913.64,-2445378925.66,2452174.23,-7519167.05,-2476672.62,-1507.62,21968.29,
-23262.72,42.90,229188496.83)))
#Error begins here:
betas<-rmvnorm(n=1000, mean=coef, sigma=sigma)
#rmvnorm breaks, Error returned:
Warning message: In sqrt(ev$values) : NaNs produced
When I Google the following search string: "rmvnorm, "Warning message: In sqrt(ev$values) : NaNs produced," I saw that:
http://www.nickfieller.staff.shef.ac.uk/sheff-only/mvatasksols6-9.pdf On Page 4 that this error indicates "negative eigen values." Although, I have no idea conceptually or practically what a negative eigen value is or why that they would be produced in this instance.
The second search result: [http://www.r-tutor.com/r-introduction/basic-data-types/complex2 Indicates that this error arises because of an attempt to take the square root of -1, which is "not a complex value" (you cannot take the square root of -1).
The question remains, what is going on here with the random generation of the betas, and how can this be corrected?
sessionInfo() R version 3.0.2 (2013-09-25) Platform:
x86_64-apple-darwin10.8.0 (64-bit)
Using the following packages/versions
mvtnorm_0.9-9994,
lme4_1.1-5,
Rcpp_0.10.3,
Matrix_1.1-2-2,
lattice_0.20-23
You have a huge range of scales in your eigenvalues:
range(eigen(sigma)$values)
## [1] -1.005407e-05 1.863477e+12
I prefer to use mvrnorm from the MASS package, just because it comes installed automatically with R. It also appears to be more robust:
set.seed(1001)
m <- MASS::mvrnorm(n=1000, mu=coef, Sigma=sigma) ## works fine
edit: OP points out that using method="svd" with rmvnorm also works.
If you print the code for MASS::mvrnorm, or debug(MASS:mvrnorm) and step through it, you see that it uses
if (!all(ev >= -tol * abs(ev[1L]))) stop("'Sigma' is not positive definite")
(where ev is the vector of eigenvalues, in decreasing order, so ev[1] is the largest eigenvalue) to decide on the positive definiteness of the variance-covariance matrix. In this case ev[1L] is about 2e12, tol is 1e-6, so this would allow negative eigenvalues up to a magnitude of about 2e6. In this case the minimum eigenvalue is -1e-5, well within tolerance.
Farther down MASS::mvrnorm uses pmax(ev,0) -- that is, if it has decided that the eigenvalues are not below tolerance (i.e. it didn't fail the test above), it just truncates the negative values to zero, which should be fine for practical purposes.
If you insisted on using rmvnorm you could use Matrix::nearPD, which tries to force the matrix to be positive definite -- it returns a list which contains (among other things) the eigenvalues and the "positive-definite-ified" matrix:
m <- Matrix::nearPD(sigma)
range(m$eigenvalues)
## [1] 1.863477e+04 1.863477e+12
The eigenvalues computed from the matrix are not quite identical -- nearPD and eigen use slightly different algorithms -- but they're very close.
range(eigen(m$mat)$values)
## [1] 1.861280e+04 1.863477e+12
More generally,
Part of the reason for the huge range of eigenvalues might be predictor variables that are scaled very differently. It might be a good idea to scale your input data if possible to make the variances more similar to each other (i.e., it will make all of your numerical computations more stable) -- you can always rescale the values once you've generated them
It's also the case that when matrices are very close to singular (i.e. some eigenvalues are very close to zero), small numerical differences can change the sign of the eigenvalues. In particular, if you copy and paste the values, you might lose some precision and cause this problem. Using dput(vcov(fit)) or save(vcov(fit)) to save the variance-covariance matrix at full precision is safer.
if you have no idea what "positive definite" means you might want to read up about it. The Wikipedia articles on covariance matrices and positive definite matrices might be a little too technical for you to start with; this question on StackExchange is closer, but still a little technical. The next entry on my Google journey was this one, which looks about right.
I want to generate sa scaled-inv-chisquared distribution in R. I know geoR have a R function for generating this. But I want to use gamma-distribution to generate this.
I think this two are equivalent:
X ~ rinvchisq(100, df=d, scale=s)
1/X ~ rgamma(100, shape=d/2, scale=2/(d*s))
isn't it? Can there be any numerical problem due this due to extreme values?
More specifically you would need X <- rinvchisq(...) and X <- 1/rgamma(...) (the ~ notation works this way in programs such as WinBUGS, and in statistics notation, but not in R). If you look at the code of geoR::rinvchisq, the relevant part is just
return((df * scale)/rchisq(n, df = df))
so if you have problems taking the reciprocal of very large or small chi-squared deviates you'll be in trouble anyway (although rchisq is internally using .External(C_rchisq, n, df), which falls through to C code, presumably for efficiency in this special case, rather than calling rgamma). If I were you I would go ahead and superimpose densities of some test samples just to make sure I hadn't screwed up the arithmetic or parameterization somewhere ...
For what it's worth there are also rinvgamma() functions in a variety of packages (library(sos); findFn("rinvgamma"))
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
how to generate pseudo-random positive definite matrix with constraints on the off-diagonal elements?
The user wants to impose a unique, non-trivial, upper/lower bound on the correlation between every pair of variable in a var/covar matrix.
For example: I want a variance matrix in which all variables have 0.9 > |rho(x_i,x_j)| > 0.6, rho(x_i,x_j) being the correlation between variables x_i and x_j.
Thanks.
There are MANY issues here.
First of all, are the pseudo-random deviates assumed to be normally distributed? I'll assume they are, as any discussion of correlation matrices gets nasty if we diverge into non-normal distributions.
Next, it is rather simple to generate pseudo-random normal deviates, given a covariance matrix. Generate standard normal (independent) deviates, and then transform by multiplying by the Cholesky factor of the covariance matrix. Add in the mean at the end if the mean was not zero.
And, a covariance matrix is also rather simple to generate given a correlation matrix. Just pre and post multiply the correlation matrix by a diagonal matrix composed of the standard deviations. This scales a correlation matrix into a covariance matrix.
I'm still not sure where the problem lies in this question, since it would seem easy enough to generate a "random" correlation matrix, with elements uniformly distributed in the desired range.
So all of the above is rather trivial by any reasonable standards, and there are many tools out there to generate pseudo-random normal deviates given the above information.
Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range. You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense. Thus, as the sample size goes to infinity, you should expect to see the specified distribution parameters. But any small sample set will not necessarily have the desired parameters, in the desired ranges.
For example, (in MATLAB) here is a simple positive definite 3x3 matrix. As such, it makes a very nice covariance matrix.
S = randn(3);
S = S'*S
S =
0.78863 0.01123 -0.27879
0.01123 4.9316 3.5732
-0.27879 3.5732 2.7872
I'll convert S into a correlation matrix.
s = sqrt(diag(S));
C = diag(1./s)*S*diag(1./s)
C =
1 0.0056945 -0.18804
0.0056945 1 0.96377
-0.18804 0.96377 1
Now, I can sample from a normal distribution using the statistics toolbox (mvnrnd should do the trick.) As easy is to use a Cholesky factor.
L = chol(S)
L =
0.88805 0.012646 -0.31394
0 2.2207 1.6108
0 0 0.30643
Now, generate pseudo-random deviates, then transform them as desired.
X = randn(20,3)*L;
cov(X)
ans =
0.79069 -0.14297 -0.45032
-0.14297 6.0607 4.5459
-0.45032 4.5459 3.6549
corr(X)
ans =
1 -0.06531 -0.2649
-0.06531 1 0.96587
-0.2649 0.96587 1
If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough.
You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring.
An approach that might work (but one that I've not totally thought out at this point) is to use the standard scheme as above to generate a random sample. Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired. Now, find a zero mean random perturbation to your sampled data that would move the sample covariance matrix in the desired direction.
This might work, but unless I knew that this is actually the question at hand, I won't bother to go any more deeply into it. (Edit: I've thought some more about this problem, and it appears to be a quadratic programming problem, with quadratic constraints, to find the smallest perturbation to a matrix X, such that the resulting covariance (or correlation) matrix has the desired properties.)
This is not a complete answer, but a suggestion of a possible constructive method:
Looking at the characterizations of the positive definite matrices (http://en.wikipedia.org/wiki/Positive-definite_matrix) I think one of the most affordable approaches could be using the Sylvester criterion.
You can start with a trivial 1x1 random matrix with positive determinant and expand it in one row and column step by step while ensuring that the new matrix has also a positive determinant (how to achieve that is up to you ^_^).
Woodship,
"First of all, are the pseudo-random deviates assumed to be normally distributed?"
yes.
"Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range."
Yes, that's the whole difficulty
"You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense."
True, but this is not the problem here: your strategy works for p=2, but fails for p>2, regardless of sample size.
"If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough."
It is not a sample size issue b/c with p>2 you do not even observe convergence to the right range for the correlations, as sample size growths: i tried the technique you suggest before posting here, it obviously is flawed.
"You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring."
Not an option, for p large (say larger than 10) this option is intractable.
"Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired."
Ditto
As for the QP, i understand the constraints, but i'm not sure about the way you define the objective function; by using the "smallest perturbation" off some initial matrix, you will always end up getting the same (solution) matrix: all the off diagonal entries will be exactly equal to either one of the two bounds (e.g. not pseudo random); plus it is kind of an overkill isn't it ?
Come on people, there must be something simpler