Distribution empirical in R - r

I have a vector of observations and would like to obtain an empirical p value of each obervation with R. I don't know the underlying distribution, and what I currently do is just
runif(100,0,1000)->ay
quantile(ay)
However, that does not really give me a p-value. How can I obtain a p-value empirically?

I think this is what you're looking for:
rank(ay)/length(ay)

I think what you want is the ecdf function. This returns an empirical cumulative distribution function, which you can apply directly
ay <- runif(100)
aycdf <- ecdf(ay)
And then
> aycdf(c(.1, .5, .7))
[1] 0.09 0.51 0.73

Related

cv.glm cutoff value of 0.75 in r

I am doing some analysis regarding a binomial glm model that I have fitted earlier in R. While looking at my data, I figured out that the suitable cutoff point for my binary outcome should be 0.75 instead of 0.5. I am trying to get the cost() function of the cv.glm() {boot package} to use the 0.75 cutoff point, but I have failed to get the right syntax.
I know for the 0.5 cutoff we normally use:
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
Can someone show me what is the right way to change the cutoff point in this function? (let's stick to 0.75 maybe).

Convolution of densities

If x is the data and hist$density gives me the empirical densities how do I obtain the convolution of x with itself. Convolve() gives me a lot of possibilities but I am not sure what to use. I need a function for this: http://en.wikipedia.org/wiki/Convolution#Discrete_convolution (actually just the real case would suffice)
Simple test: convolution of Bernoullies gives me a Binomial
Pr=c(0.7, 0.3)
Right answer should be a Binomial of parameters n=2 p=0.3
Ok right answer is:
> convolve(Pr,rev(Pr),type="o")
[1] 0.49 0.42 0.09
a binomial of parameters n=2 and p=0.3. So to convolute densities one could use:
convolve(his$density, rev(his$density), type="o")
This works well with discrete distributions but may work (very) badly for continuous distributions.
Note: if u want the fdr use cumsum() on the result of the convolution
Suggestion: for continuous distributions table(x)/sum(table(x)) gives a more accurate input for convolution

Find quantiles of gamma like distribution

The data is gamma like distributed.
To replicate the data would be something like this:
a) first find the distrib. parameters of the true data:
fitdist(datag, "gamma", optim.method="Nelder-Mead")
b) Use the parameters shape, rate, scale to simulate data:
data <- rgamma(10000, shape=0.6, rate=4.8, scale=1/4.8)
To find quantiles using the qgamma function in r, would be just:
EDIT:
qgamma(c(seq(1,0.1,by=-0.1)), shape=0.6, rate =4.8, scale = 1/4.8, log = FALSE)
How I can find quantiles for my true data (not simulated with rgamma)?
Please note that the quantile r function returns the desired quantiles of the true data (datag) but these are as I understand assuming the data are normally distributed. As you can see they are clearly not.
quantile(datag, seq(0,1, by=0.1), type=7)
What function in r to use or otherwise how to obtain statistically the quantiles for the highly skewed data?
In addition, would this make sense somewhat? But still not getting the lower values!
Fn <- ecdf(datag)
Fn(seq(0.1,1,by=0.1))
Quantiles are returned by the "q" functions, in this case qgamma. For your data the eyeball integration suggests that most of the data is to the left of 0.2 and if we ask for the 0.8 quantile we see that 80% of the data in the estimated distribution is to the left of:
qgamma(.8, shape=0.6, rate=4.8)
#[1] 0.20604
Seems to agree with what you have plotted. If you wanted the 0.8 quantile in the sample you have, then just:
quantile(datag, 0.8)

R: Function that finds the range of 95% of all values?

Is there a function or an elegant way in the R language, to get the minimum range, that covers, say 95% of all values in a vector?
Any suggestions are very welcome :)
95% of the data will fall between the 2.5th percentile and 97.5th percentile. You can compute that value in R as follows:
x <- runif(100)
quantile(x,probs=c(.025,.975))
To get a sense of what's going on, here's a plot:
qts <- quantile(x,probs=c(.05,.95))
hist(x)
abline(v=qts[1],col="red")
abline(v=qts[2],col="red")
Note this is the exact/empirical 95% interval; there's no normality assumption.
It's not so hard to write such function:
find_cover_region <- function(x, alpha=0.95) {
n <- length(x)
x <- sort(x)
k <- as.integer(round((1-alpha) * n))
i <- which.min(x[seq.int(n-k, n)] - x[seq_len(k+1L)])
c(x[i], x[n-k+i-1L])
}
Function will find shortest interval. If there are intervals with the same length first (from -Inf) will be picked up.
find_cover_region(1:100, 0.70)
# [1] 1 70
find_cover_region(rnorm(10000), 0.9973) # three sigma, approx (-3,3)
# [1] -2.859 3.160 # results may differ
You could also look on highest density regions (e.g. in package hdrcde, function hdr). It's more statistical way to find shortest intervals with given cover probability (some kernel density estimators are involved).
The emp.hpd function in the TeachingDemos package will find the values in a vector that enclose a given percentage of the data (95%) that also give the shortest range between the values. If the data is roughly symmetric then this will be close to the results of using quantile, but if the data are skewed then this will give a shorter range.
If the values are distributed approximately like the normal distribution, you can use the standard deviation. First, calculate the mean µ and standard deviation of the distribution. 95% of the values will be in the interval of (µ - 1.960 * stdev, µ + 1.960 * stdev).

R: empirical version of pnorm() and qnorm()?

I have a normalization method that uses the normal distribution functions pnorm() and qnorm(). I want to alter my logic so that I can use empirical distributions instead of assuming normality. I've used ecdf() to calculate the empirical cumulative distributions but then realized I was beginning to write a function that basically was the p and q versions of the empirical. Is there a simpler way to do this? Maybe a package with pecdf() and qecdf()? I hate reinventing the wheel.
You can use the quantile and ecdf functions to get qecdf and pecdf, respectively:
x <- rnorm(20)
quantile(x, 0.3, type=1) #30th percentile
Fx <- ecdf(x)
Fx(0.1) # cdf at 0.1
'emulating' pnorm for an empirical distribution with ecdf:
> set.seed(42)
> x <- ecdf(rnorm(1000))
> x(0)
[1] 0.515
> pnorm(0)
[1] 0.5
Isn't that exactly what bootstrap p-values do?
If so, keep a vector, sort, and read out at the appropriate position (i.e. 500 for 5% on 10k reptitions). There are some subtle issue with with positions to pick as e.g. help(quantile) discusses under 'Types'.

Resources