I have a normalization method that uses the normal distribution functions pnorm() and qnorm(). I want to alter my logic so that I can use empirical distributions instead of assuming normality. I've used ecdf() to calculate the empirical cumulative distributions but then realized I was beginning to write a function that basically was the p and q versions of the empirical. Is there a simpler way to do this? Maybe a package with pecdf() and qecdf()? I hate reinventing the wheel.
You can use the quantile and ecdf functions to get qecdf and pecdf, respectively:
x <- rnorm(20)
quantile(x, 0.3, type=1) #30th percentile
Fx <- ecdf(x)
Fx(0.1) # cdf at 0.1
'emulating' pnorm for an empirical distribution with ecdf:
> set.seed(42)
> x <- ecdf(rnorm(1000))
> x(0)
[1] 0.515
> pnorm(0)
[1] 0.5
Isn't that exactly what bootstrap p-values do?
If so, keep a vector, sort, and read out at the appropriate position (i.e. 500 for 5% on 10k reptitions). There are some subtle issue with with positions to pick as e.g. help(quantile) discusses under 'Types'.
Related
I want to analyse a logarithmic growth curve in more detail. Especially I would like to kow the time point when the slope becomes >0 (which is the starting point of growth after a lag phase).
Therefore I fitted a logarithmic function to my growth data with the grofit package of R. I got values for the three parameters (lambda, mu, maximal assymptote).
Now I thought, I could use the first derivative of the logarithmic growth function to put mu=0 (the slope of any time point during growth) and this way solve the equation for the time (x). I'm not sure if this is possible, since the mu=0 will be correct for a longer timespan at the beginning of the curve (and no unique timepoint). But maybe I could approximate to that point by putting mu=0.01. This should be more specific.
Anyway I used the Deriv package to find the first derivative of my logarithmic function:
Deriv(a/(1+exp(((4*b)/a)*(c-x)+2)), "x")
where a=assymptote, b=maximal slope, c=lambda.
As a result I got:
{.e2 <- exp(2 + 4 * (b * (c - x)/a))
4 * (.e2 * b/(.e2 + 1)^2)}
Or in normal writing:
f'(x)=(4*exp(2+((4b(c-x))/a))*b)/((exp(2+((4b(c-x))/a))+1)^2)
Now I would like to solve this function for x with f'(x)=0.01. Can anyone tell me, how best to do it?
Also, do you have comments on my way of thinking or the R functions I used?
Thank you.
Anne
Using a root solving function is more appropriate than using an optimization function.
I'll give an example with two packages.
It would also be a good idea to plot the function for a range of values.
Like this:
curve(fn,-.1,.1)
You can see that using the base R function uniroot will present problems since it needs function values at the endpoints of the interval to be of opposite sign.
Using package nleqslv like this
library(nleqslv)
nleqslv(1,fn)
gives
$x
[1] 0.003388598
$fvec
[1] 8.293101e-10
$termcd
[1] 1
$message
[1] "Function criterion near zero"
<more info> ......
Using function fsolve from package pracma
library(pracma)
fsolve(fn,1)
gives
$x
[1] 0.003388585
$fval
[1] 3.136539e-10
The solutions given by both packages are very close to each other.
Might not be the best approach but you can use the optim function to find the solution. Check the code below, I am basically trying to find the value of x which minimizes abs(f(x) - 0.01)
There starting seed value for x may be important, the optim function might not converge for some seeds.
fn <- function(x){
a <- 1
b<- 1
c <- 1
return( abs((4*exp(2+((4*b*(c-x))/a))*b)/ ((exp(2+((4*b*(c-x))/a))+1)^2) - 0.01) )
}
x <- optim(10,fn)
x$par
Thank you very much for your efforts. Unfortunately, none of the above solutions worked for me :-(
I figured the problem out the old fashioned way (pencil + paper + mathematics book).
Have a good day
Anne
I have a random variable X and a transformation f and I would like to know the probability distribution function of f(X), at least approximately. In Mathematica there is TransformedDistribution, but I could not find something similar in R. As I said, some kind of approximative solution would be fine, too.
You can check the distr package. For instance, say that y = x^2+2x+1, where x is normally distributed with mean 2 and standard deviation 5. You can:
require(distr)
x<-Norm(2,5)
y<-x^2+2*x+1
#y#r gives random samples. We make an histogram.
hist(y#r(10000))
#y#d and y#p are the density and the cumulative functions
y#d(80)
#[1] 0.002452403
y#p(80)
#[1] 0.8891796
I have a vector of observations and would like to obtain an empirical p value of each obervation with R. I don't know the underlying distribution, and what I currently do is just
runif(100,0,1000)->ay
quantile(ay)
However, that does not really give me a p-value. How can I obtain a p-value empirically?
I think this is what you're looking for:
rank(ay)/length(ay)
I think what you want is the ecdf function. This returns an empirical cumulative distribution function, which you can apply directly
ay <- runif(100)
aycdf <- ecdf(ay)
And then
> aycdf(c(.1, .5, .7))
[1] 0.09 0.51 0.73
Is there any function in R which will calculate the inverse kernel(i am considering normal) CDF for a particular alpha(0,1).
I have found quantile but I am not sure how it works.
Thanks
We can integrate to get the cdf and we can use a root finding algorithm to invert the cdf. First we'll want to interpolate the output from density.
set.seed(10000)
x <- rnorm(1000, 10, 13)
pdf <- density(x)
# Interpolate the density
f <- approxfun(pdf$x, pdf$y, yleft=0, yright=0)
# Get the cdf by numeric integration
cdf <- function(x){
integrate(f, -Inf, x)$value
}
# Use a root finding function to invert the cdf
invcdf <- function(q){
uniroot(function(x){cdf(x) - q}, range(x))$root
}
which gives
med <- invcdf(.5)
cdf(med)
#[1] 0.5000007
This could definitely be improved upon. One issue is that I don't guarantee that the cdf is always less than or equal to 1 (and if you check the cdf for values larger than max(x) you might get something like 1.00097. But I'm too tired to fix that now. This should give a decent start.
An alternative approach would be to use log-spline density estimation rather than kernel density estimation. Look at the 'logspline' package. With logspline density estimations you get CDF (plogspline) and inverse CDF (qlogspline) functions.
Is there a function or an elegant way in the R language, to get the minimum range, that covers, say 95% of all values in a vector?
Any suggestions are very welcome :)
95% of the data will fall between the 2.5th percentile and 97.5th percentile. You can compute that value in R as follows:
x <- runif(100)
quantile(x,probs=c(.025,.975))
To get a sense of what's going on, here's a plot:
qts <- quantile(x,probs=c(.05,.95))
hist(x)
abline(v=qts[1],col="red")
abline(v=qts[2],col="red")
Note this is the exact/empirical 95% interval; there's no normality assumption.
It's not so hard to write such function:
find_cover_region <- function(x, alpha=0.95) {
n <- length(x)
x <- sort(x)
k <- as.integer(round((1-alpha) * n))
i <- which.min(x[seq.int(n-k, n)] - x[seq_len(k+1L)])
c(x[i], x[n-k+i-1L])
}
Function will find shortest interval. If there are intervals with the same length first (from -Inf) will be picked up.
find_cover_region(1:100, 0.70)
# [1] 1 70
find_cover_region(rnorm(10000), 0.9973) # three sigma, approx (-3,3)
# [1] -2.859 3.160 # results may differ
You could also look on highest density regions (e.g. in package hdrcde, function hdr). It's more statistical way to find shortest intervals with given cover probability (some kernel density estimators are involved).
The emp.hpd function in the TeachingDemos package will find the values in a vector that enclose a given percentage of the data (95%) that also give the shortest range between the values. If the data is roughly symmetric then this will be close to the results of using quantile, but if the data are skewed then this will give a shorter range.
If the values are distributed approximately like the normal distribution, you can use the standard deviation. First, calculate the mean µ and standard deviation of the distribution. 95% of the values will be in the interval of (µ - 1.960 * stdev, µ + 1.960 * stdev).