Finding the intersection of two curves in a scatterplot (here: pvalues vs test-statistics) - r

i do
library(Hmisc)
df <- as.matrix(replicate(20, rnorm(20)))
cor.df <- rcorr(df)
plot(cor.df$r,cor.df$P)
abline(h=0.05)
and i would like to know if R can compute the meeting point of the horizontal line and the bell-curve. Since i have a scatterplot, do i need to model the x,y-curve first, and then balance the two functions? Or can R do that graphically?
I actually want to know what the treshold for (uncorrected) pvalues indicating a significant test statistics for a given dataset would be. I am not a trained statistician, so excuse me if that is a basic question.
Thank you very much!

There is no function to graphically calculate an intersection. There are functions like uniroot that you can use in R to find intersections, but you need to have proper functions and have a good idea of the interval where the intersection occurs.
It would be best to properly model the curve in question, but a simply way to approximate a function when you have a bunch of points on the curve is just to use linear interpolation between the observed points. You can create a function for your points with approxfun
f1 <- approxfun(cor.df$r,cor.df$P, rule=2)
(again, a proper model would be better, but just for the sake of example, i'll continue with this function).
Now we can find the place where this curve cross 0.05 with
uniroot(function(x) f1(x)-.05, c(-1,-.001))$root
# [1] -0.4437796
uniroot(function(x) f1(x)-.05, c(.001, 1))$root
# [1] 0.4440005

Related

Computing ECDF of a data for parameter estimation using weighted nonlinear least square in R

I am writing a code for estimating the parameter of a GPD using weighted nonlinear least square(WNLS) method.
The WNLS method consist of 2 steps
step 1: $(\hat{\xi_1} , \hat{b_1}) = arg\ \min_{(\xi,b)} \sum_{i=1}^{n} [\log(1-F_n(x_i)) - log(1-G_{\xi,b}(x_i))]$,
here $F_n$ is the ECDF and $1-G_{\xi,b}$ is the generalized pareto distribution.
Can anyone let me know how to calculate EDF function $F_n$ for a data "X" in R?
Does ecdf(X)(X) will calculate the ECDF? If so then, what is the need for ecdf(X) other than plotting? Also it would be really helpful if someone share some example code which involves the calculation of ECDF for data.
The ecdf call creates a function. That is, you can apply ecdf(X) to other data, as your ecdf(X)(X) call does. However, you might want to apply ecdf(X) to something other than X itself. If you want to know the empirical quantile to which three numbers a, b, and c_ correspond, an easy way to do that is to call ecdf(X)(c(a, b, c_)).

How to average graph functions and find the confidence band in R

I'm using the 'spatstat' package in R and obtained a set of Ripley's K functions (or L functions). I want to find a good way to average out this set of graphs on a single average line, as well as graphing out the standard deviation or confidence interval around this average line.
So far I've tried:
env.A <- envelope(A, fun=Lest, correction=c("Ripley"), nsim=99, rank=1, global=TRUE)
Aa <- env.A
avg <- eval.fv((Aa+Bb+Cc+Dd+Ee+Ff+Gg+Hh+Ii+Jj+Kk+Ll+Mm+Nn+Oo+Pp+Qq+Rr+Ss+Tt+Uu+Vv+Ww+Xx)/24)
plot(avg, xlim=c(0,200), . - r ~ r, ylab='', legend='')
With this, I got the average line from the data set.
However, I'm now stuck on finding the confidence interval around this average line.
Does anyone know a good way to do this?
The help file for envelope explains how to do this.
E <- envelope(A, Lest, correction="Ripley", nsim=100, VARIANCE=TRUE)
plot(E, . - r ~ r)
See help(envelope) for more explanation.
In this example, the average or middle curve is computed using a theoretical formula, because the simulations are generated from Complete Spatial Randomness, and the theoretical value of the L function is known. If you want the middle curve to be determined by the sample averages instead, set use.theo = FALSE in the call to envelope.
Can I also point out that the bands you get from envelope are not confidence intervals. A confidence interval would be centred around the estimated L function for the data point pattern A. The bands you get from the envelope command are centred around the mean value of the simulated curves. They are significance bands and their interpretation is related to a statistical significance test. This is also explained in the help file.

Plotting a difference between two ecdf()

I have two sets of 100.000 observations that come from a simulation.
Since one of the two cases is a 'baseline' case and the other is a 'treatment' case, I want create a plot that highlights the difference in distribution of the two simulations.
I started with an ecdf() of the two populations. The result is in the picture.
What I would like to do is to have a plot of the difference between the two ecdf curves.
A simple ecdf(baseline) - ecdf(treatment) does not work since ecdf returns a function; even using Ecdf from the Hmisc package does not work, since Ecdf returns a list and again the differene '-' operator is ill-defined in such a case.
By running this code you can get to the scenario described by the picture above
a <- runif(10000)
b <- rnorm(10000,0.5,0.5)
plot(ecdf(a))
lines(ecdf(b), col='red')
Any hints would be more than welcome.
So evaluate the functions?
decdf <- function(x, baseline, treatment) ecdf(baseline)(x) - ecdf(treatment)(x)

Simulate Values under custom density

I have a theoretical and coding question that has to do with densities and simulating values.
I am building custom densities via the density(x) command. However I am hoping to generate 1000-10000 simulated values from this density. The overall goal is to take two densities build by density(x$y) form and run simulations and say this density A is more than density B x% of the time. I would just take each simulated value and see which is higher and code to count how many times A is higher than B.
Is there a way to accomplish this? Or is there some way to accomplish something similar with these densities? Thanks!
The sample function can take the midpoints of the intervals of the sample density and then use the densities as the prob-arguments.
mysamp <- sample(x= dens$x, size=1000 , prob=dens$y, repl=TRUE)
This has the disadvantage that you may need to jitter the result to avoid lots of duplicates.
mysamp <- jitter(mysamp)
Another method is to use approxfun and ecdf. You may need to invert the function (reverse role of x and y) in order to sample using the input of runif(1000) into the result. I'm pretty sure there are worked examples of this in SO and I'm pretty sure that I am one of many who in the past have posted such code to R-help. (If your searches have failed to find then then post the search strategies and others can try to improve upon them.)
Following #DWin's tip to invert the ecdf, here is how to implement such an approach, using a spline to fit the inverted step-function:
Given
z <- c(rnorm(40), runif(40))
plot(density(z))
Define
spl <- with(environment(ecdf(z)), splinefun(y, x))
sampler <- function(n)spl(runif(n))
Now you can call sampler() with the size you want:
plot(density(sampler(1000)))
Final note: This will never generate values outside the range of the original data, but duplicates will be extremely rare:
> anyDuplicated(sampler(1e4))
[1] 0

analytical derivative of splinefun()

I'm trying to fit a natural cubit spline to probabilistic data (probabilities that a random variable is smaller than certain values) to obtain a cumulative distribution function, which works well enough using splinefun():
cutoffs <- c(-90,-60,-30,0,30,60,90,120)
probs <- c(0,0,0.05,0.25,0.5,0.75,0.9,1)
CDF.spline <- splinefun(cutoffs,probs, method="natural")
plot(cutoffs,probs)
curve(CDF.spline(x), add=TRUE, col=2, n=1001)
I would then, however, like to use the density function, i.e. the derivative of the spline, to perform various calculations (e.g. to obtain the expected value of the random variable).
Is there any way of obtaining this derivative as a function rather than just evaluated at a discrete number of points via splinefun(x, deriv=1)?
This is pretty close to what I'm looking for, but alas the example doesn't seem to work in R version 2.15.0.
Barring an analytical solution, what's the cleanest numerical way of going about this?
If you change the environment assignment line for g in the code the Berwin Turlach provided on R-help to this:
environment(g) <- environment(f)
... you succeed in R 2.15.1.

Resources