R small pvalues - r

I am calculating z-scores to see if a value is far from the mean/median of the distribution.
I had originally done it using the mean, then turned these into 2-side pvalues. But now using the median I noticed that there are some Na's in the pvalues.
I determined this is occuring for values that are very far from the median.
And looks to be related to the pnorm calculation.
"
'qnorm' is based on Wichura's algorithm AS 241 which provides
precise results up to about 16 digits. "
Does anyone know a way around this as I would like the very small pvalues.
Thanks,
> z<- -12.5
> 2-2*pnorm(abs(z))
[1] 0
> z<- -10
> 2-2*pnorm(abs(z))
[1] 0
> z<- -8
> 2-2*pnorm(abs(z))
[1] 1.332268e-15

Intermediately, you are actually calculating very high p-values:
options(digits=22)
z <- c(-12.5,-10,-8)
pnorm(abs(z))
# [1] 1.0000000000000000000000 1.0000000000000000000000 0.9999999999999993338662
2-2*pnorm(abs(z))
# [1] 0.000000000000000000000e+00 0.000000000000000000000e+00 1.332267629550187848508e-15
I think you will be better off using the low p-values (close to zero) but I am not good enough at math to know whether the error at close-to-one p-values is in the AS241 algorithm or the floating point storage. Look how nicely the low values show up:
pnorm(z)
# [1] 3.732564298877713761239e-36 7.619853024160526919908e-24 6.220960574271784860433e-16
Keep in mind 1 - pnorm(x) is equivalent to pnorm(-x). So, 2-2*pnorm(abs(x)) is equivalent to 2*(1 - pnorm(abs(x)) is equivalent to 2*pnorm(-abs(x)), so just go with:
2 * pnorm(-abs(z))
# [1] 7.465128597755427522478e-36 1.523970604832105383982e-23 1.244192114854356972087e-15
which should get more precisely what you are looking for.

One thought, you'll have to use an exp() with larger precision, but you might be able to use log(p) to get slightly more precision in the tails, otherwise you are effectively at 0 for the non-log p values in terms of the range that can be calculated:
> z<- -12.5
> pnorm(abs(z),log.p=T)
[1] -7.619853e-24
Converting back to the p value doesn't work well, but you could compare on log(p)...
> exp(pnorm(abs(z),log.p=T))
[1] 1

pnorm is a function which gives what P value is based on given x. If You do not specify more arguments, then default distribution is Normal with mean 0, and standart deviation 1.
Based on simetrity, pnorm(a) = 1-pnorm(-a).
In R, if you add positive numbers it will round them. But if you add negative no rounding is done. So using this formula and negative numbers you can calculate needed values.
> pnorm(0.25)
[1] 0.5987063
> 1-pnorm(-0.25)
[1] 0.5987063
> pnorm(20)
[1] 1
> pnorm(-20)
[1] 2.753624e-89

Related

Create log spaced numbers in R with high upper bound

I want to create a set of 10 logspaced numbers from zero to some big number M, say M=60,000, for example in R.
First, I tried to use lseq() from the package emdbook. The problem with lseq, however, is that it cannot handle 0 as a starting point. (This is due to the fact that it will try to calculate log(0) and then fail).
Next, I tried to use logspace() from the pracma package in the following way:
Numbers <- log(logspace(0,M,10),base=10)
This works fine for values of M up to about 340. From then on the numbers in the set will become infinity because the exponential function becomes too large.
Is there any other way in R to create a set of logspaced numbers from zero to some big number M which will not make most of the numbers in the set infinity and which can actually handle zero as a starting point?
Correct me if I am wrong, but can't you just çalculate the logspaces for lower numbers and then multiply? They should be linearly related right? Just look at this output:
library(pracma)
> log(logspace(0,60, 10), base = 10)[1:5]
[1] 0.000000000000000 6.666666666666667 13.333333333333334 20.000000000000000 26.666666666666668
> log(logspace(0,600, 10), base = 10)[1:5]
[1] 0.000000000000000 66.666666666666671 133.333333333333343 200.000000000000000 266.666666666666686
> x1 <- (log(logspace(0,600, 10), base = 10)*100)[2]
> x1
[1] 6666.666666666667
> x2 <- seq(0 , 9, 1)*x1
> x2
[1] 0.000000000000 6666.666666666667 13333.333333333334 20000.000000000000 26666.666666666668
[6] 33333.333333333336 40000.000000000000 46666.666666666672 53333.333333333336 60000.000000000000

What are the results in the dt function?

Cans someone explain the results in a typical dt function? The help page says that I should receive the density function. However, in my code below, what does the first value ".2067" represent?The second value?
x<-seq(1,10)
dt(x, df=3)
[1] 0.2067483358 0.0675096607 0.0229720373 0.0091633611 0.0042193538 0.0021748674
[7] 0.0012233629 0.0007369065 0.0004688171 0.0003118082
Two things were confused here:
dt gives you the density, this is why it decreases for large numbers:
x<-seq(1,10)
dt(x, df=3)
[1] 0.2067483358 0.0675096607 0.0229720373 0.0091633611 0.0042193538 0.0021748674
[7] 0.0012233629 0.0007369065 0.0004688171 0.0003118082
pt gives the distribution function. This is the probability of being smaller or equal x.
This is why the values go to 1 as x increases:
pt(x, df=3)
[1] 0.8044989 0.9303370 0.9711656 0.9859958 0.9923038 0.9953636 0.9970069 0.9979617 0.9985521 0.9989358
A "probability density" is not really a true probability, since probabilities are bounded in [0,1] while densities are not. The integral of densities across their domain is normalized to exactly 1. So densities are really the first derivatives of the probability function. This code may help:
plot( x= seq(-10,10,length=100),
y=dt( seq(-10,10,length=100), df=3) )
The value of 0.207 for dt at x=1 says that at x=1 that the probability is increasing at a rate of 0.207 per unit increase in x. (And since the t-distribution is symmetric that is also the value of dt with 3 df at -1.)
A bit of coding to instantiate the dt(x,df=3) function (see ?dt) and then integrate it:
> dt3 <- function(x) { gamma((4)/2)/(sqrt(3*pi)*gamma(3/2))*(1+x^2/3)^-((3+1)/2) }
> dt3(1)
[1] 0.2067483
> integrate(dt3, -Inf, Inf)
1 with absolute error < 7.2e-08

Solving equations in R similar to the Excel solver parameters function

I have a question concerning the possibility to solve functions in R, and doing the same using excel.
However I want to do it with R to show that R is better for my colleagues :)
Here is the equation:
f0<-1e-9
t_pw<-30e-9
a<-30.7397582453682
c<-6.60935546184612
P<-1-exp((-t_pw)*f0*exp(-a*(1-b/c)^2))
I want to find the b value for P<-0.5. In Excel we can do it by selecting P value column and setting it to 0.5 and then by using the solver parameters function.
I don't know which method is the best? Or any other way to do it?
Thankx.
I have a strong suspicion that your equation was supposed to include -t_pw/f0, not -t_pw*f0, and that t_pw was supposed to be 3.0e-9, not 30e-9.
Pfun <- function(b,f0=1e-9,t_pw=3.0e-9,
a=30.7397582453682,
c=6.60935546184612) {
1-exp((-t_pw)/f0*exp(-a*(1-b/c)^2))
}
Then #Lyzander's uniroot() suggestion works fine:
u1 <- uniroot(function(x) Pfun(x)-0.5,c(6,10))
The estimated value here is 8.05.
par(las=1,bty="l")
curve(Pfun,from=0,to=10,xname="b")
abline(h=0.5,lty=2)
abline(v=u1$root,lty=3)
If you want to solve an equation the simplest thing is to do is to use uniroot which is in base-R.
f0<-1e-9
t_pw<-30e-9
a<-30.7397582453682
c<-6.60935546184612
func <- function(b) {
1-exp((-t_pw)*f0*exp(-a*(1-b/c)^2)) - 0.5
}
#interval is the range of values of b to look for a solution
#it can be -Inf, Inf
> uniroot(func, interval=c(-1000, 1000), extendInt='yes')
Error in uniroot(func, interval = c(-1000, 1000), extendInt = "yes") :
no sign change found in 1000 iterations
As you see above my unitroot function fails. This is because there is no single solution to your equation which is easy to see as well. exp(-0.0000000000030 * <positive number between 0-1>) is practically (very close to) 1 so your equation becomes 1 - 1 - 0.5 = 0 which doesn't hold. You can see the same with a plot as well:
curve(func) #same result for curve(func, from=-1000, to=1000)
In this function the result will be -0.5 for any b.
So one way to do it fast, is uniroot but probably for a different equation.
And a working example:
myfunc2 <- function(x) x - 2
> uniroot(myfunc2, interval=c(0,10))
$root
[1] 2
$f.root
[1] 0
$iter
[1] 1
$init.it
[1] NA
$estim.prec
[1] 8

R minimize absolute error

Here's my setup
obs1<-c(1,1,1)
obs2<-c(0,1,2)
obs3<-c(0,0,3)
absoluteError<-function(obs,x){
return(sum(abs(obs-x)))
}
Example:
> absoluteError(obs2,1)
[1] 2
For a random vector of observations, I'd like to find a minimizer, x, which minimizes the absolute error between the observation values and a vector of all x. For instance, clearly the argument that minimizes absoluteError(obs1,x) is x=1 because this results in an error of 0. How do I find a minimizer for a random vector of observations? I'd imagine this is a linear programming problem, but I've never implemented one in R before.
The median of obs is a minimizer for the absolute error. The following is a sketch of how one might try proving this:
Let the median of a set of n observations, obs, be m. Call the absolute error between obs and m f(obs,m).
Case n is odd:
Consider f(obs,m+delta) where delta is some non zero number. Suppose delta is positive - then there are (n-1)/2 +1 observations whose error is delta more than f(obs,m). The remaining (n-1)/2 observations' error is at most delta less than f(obs,m). So f(obs,m+delta)-f(obs,m)>=delta. (The same argument can be made if delta is negative.) So the median is the only minimizer in this case. Thus f(obs,m+delta)>f(obs,m) for any non zero delta so m is a minimizer for f.
Case n is even:
Basically the same logic as above, except in this case any number between the two inner most numbers in the set will be a minimizer.
I am not sure this answer is correct, and even if it is I am not sure this is what you want. Nevertheless, I am taking a stab at it.
I think you are talking about 'Least absolute deviations', a form of regression that differs from 'Least Squares'.
If so, I found this R code for solving Least absolute deviations regression:
fabs=function(beta0,x,y){
b0=beta0[1]
b1=beta0[2]
n=length(x)
llh=0
for(i in 1:n){
r2=(y[i]-b0-b1*x[i])
llh=llh + abs(r2)
}
llh
}
g=optim(c(1,1),fabs,x=x,y=y)
I found the code here:
http://www.stat.colostate.edu/~meyer/hw12ans.pdf
Assuming you are talking about Least absolute deviations, you might not be interested in the above code if you want a solution in R from scratch rather than a solution that uses optim.
The above code is for a regression line with an intercept and one slope. I modified the code as follows to handle a regression with just an intercept:
y <- c(1,1,1)
x <- 1:length(y)
fabs=function(beta0,x,y){
b0=beta0[1]
b1=0
n=length(x)
llh=0
for(i in 1:n){
r2=(y[i]-b0-b1*x[i])
llh=llh + abs(r2)
}
llh
}
# The commands to get the estimator
g = optim(c(1),fabs,x=x,y=y, method='Brent', lower = (min(y)-5), upper = (max(y)+5))
g
I was not familiar with (i.e., had not heard of) Least absolute deviations until tonight. So, hopefully my modifications are fairly reasonable.
With y <- c(1,1,1) the parameter estimate is 1 (which I think you said is the correct answer):
$par
[1] 1
$value
[1] 1.332268e-15
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
With y <- c(0,1,2) the parameter estimate is 1:
$par
[1] 1
$value
[1] 2
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
With y <- c(0,0,3) the parameter estimate is 0 (which you said is the correct answer):
$par
[1] 8.613159e-10
$value
[1] 3
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
If you want R code from scratch, there is additional R code in the file at the link above which might be helpful.
Alternatively, perhaps it might be possible to extract the relevant code from the source file.
Alternatively, perhaps someone else can provide the desired code (and correct any errors on my part) in the next 24 hours.
If you come up with code from scratch please post it as an answer as I would love to see it myself.
lad=function(x,y){
SAD = function(beta, x, y) {
return(sum(abs(y - (beta[1] + beta[2] * x))))
}
d=lm(y~x)
ans1 = optim(par=c(d$coefficients[1], d$coefficients[2]),method = "Nelder-Mead",fn=SAD, x=x, y=y)
coe=setNames(ans1$par,c("(Intercept)",substitute(x)))
fitted=setNames(ans1$par[1]+ans1$par[2]*x,c(1:length(x)))
res=setNames(y-fitted,c(1:length(x)))
results = list(coefficients=coe, fitted.values=fitted, residuals=res)
class(results)="lad"
return(results)
}

more significant digits

How can I get more significant digits in R? Specifically, I have the following example:
> dpois(50, lambda= 5)
[1] 1.967673e-32
However when I get the p-value:
> 1-ppois(50, lambda= 5)
[1] 0
Obviously, the p-value is not 0. In fact it should greater than 1.967673e-32 since I'm summing a bunch of probabilities. How do I get the extra precision?
Use lower.tail=FALSE:
ppois(50, lambda= 5, lower.tail=FALSE)
## [1] 2.133862e-33
Asking R to compute the upper tail is much more accurate than computing the lower tail and subtracting it from 1: given the inherent limitations of floating point precision, R can't distinguish (1-eps) from 1 for values of eps less than .Machine$double.neg.eps, typically around 10^{-16} (see ?.Machine).
This issue is discussed in ?ppois:
Setting ‘lower.tail = FALSE’ allows to get much more precise
results when the default, ‘lower.tail = TRUE’ would return 1, see
the example below.
Note also that your comment about the value needing to be greater than dpois(50, lambda=5) is not quite right; ppois(x,...,lower.tail=FALSE) gives the probability that the random variable is greater than x, as you can see (for example) by seeing that ppois(0,...,lower.tail=FALSE) is not exactly 1, or:
dpois(50,lambda=5) + ppois(50,lambda=5,lower.tail=FALSE)
## [1] 2.181059e-32
ppois(49,lambda=5,lower.tail=FALSE)
## [1] 2.181059e-32

Resources