How to average graph functions and find the confidence band in R - r

I'm using the 'spatstat' package in R and obtained a set of Ripley's K functions (or L functions). I want to find a good way to average out this set of graphs on a single average line, as well as graphing out the standard deviation or confidence interval around this average line.
So far I've tried:
env.A <- envelope(A, fun=Lest, correction=c("Ripley"), nsim=99, rank=1, global=TRUE)
Aa <- env.A
avg <- eval.fv((Aa+Bb+Cc+Dd+Ee+Ff+Gg+Hh+Ii+Jj+Kk+Ll+Mm+Nn+Oo+Pp+Qq+Rr+Ss+Tt+Uu+Vv+Ww+Xx)/24)
plot(avg, xlim=c(0,200), . - r ~ r, ylab='', legend='')
With this, I got the average line from the data set.
However, I'm now stuck on finding the confidence interval around this average line.
Does anyone know a good way to do this?

The help file for envelope explains how to do this.
E <- envelope(A, Lest, correction="Ripley", nsim=100, VARIANCE=TRUE)
plot(E, . - r ~ r)
See help(envelope) for more explanation.
In this example, the average or middle curve is computed using a theoretical formula, because the simulations are generated from Complete Spatial Randomness, and the theoretical value of the L function is known. If you want the middle curve to be determined by the sample averages instead, set use.theo = FALSE in the call to envelope.
Can I also point out that the bands you get from envelope are not confidence intervals. A confidence interval would be centred around the estimated L function for the data point pattern A. The bands you get from the envelope command are centred around the mean value of the simulated curves. They are significance bands and their interpretation is related to a statistical significance test. This is also explained in the help file.

Related

Compute and plot central credible and highest posterior density intervals for distributions in Distributions.jl

I would like to (i) compute and (ii) plot the central credible interval and the highest posterior density intervals for a distribution in the Distributions.jl library.
Ideally, one can write their own function to compute CI and HPD and then use Plots.jl to plot them. However, I'm finding the implementation quite tricky (disclaimer: I'm new to Julia).
Any suggestions about libraries/gists/repo to check out that make the computing and plotting them easier?
Context
using Plots, StatsPlots, LaTeXStrings
using Distributions
dist = Beta(10, 10)
plot(dist) # thanks to StatsPlots it nicely plots the distribution
# missing piece 1: compute CI and HPD
# missing piece 2: plot CI and HPD
Expected end result summarized in the image below or at p. 33 of BDA3.
Resources found so far:
gist: uses PyPlot, though
hdrcde R package
Thanks for updating the question; it brings a new perspective.
The gist is kind of correct; only it uses an earlier version of Julia.
Hence linspace should be replaced by LinRange. Instead of using PyPlot use using Plots.
I would change the plotting part to the following:
plot(cred_x, pdf(B, cred_x), fill=(0, 0.9, :orange))
plot!(x,pdf(B,x), title="pdf with 90% region highlighted")
At first glance, the computation of the CI seems correct. (Like the answer from Closed Limelike Curves or the answer from the question [there][1]). For the HDP, I concur with Closed Limelike Curves. Only I would add that you could build your HDP function upon the gist code. I would also have a version for posterior with a known distribution (like in your reference document page 33, figure 2.2) as you don't need to sample. And another with sampling like Closed Limelike Curves indicated.
OP edited the question, so I'm giving a new answer.
For central credible intervals, the answer is pretty easy: Take the quantiles at each point:
lowerBound = quantile(Normal(0, 1), .025)
upperBound = quantile(Normal(0, 1), .975)
This will give you an interval where the probability of x lying below the lower bound .025, and similarly for the upper bound, adding up to .05.
HPDs are harder to calculate. In addition, they tend to be less common because they have some weird properties that aren't shared by central credible intervals. The easiest way to do it is probably using a Monte Carlo algorithm. Use randomSample = rand(Normal(0, 1), 2^12) to draw 2^12 samples from the Normal distribution. (Or however many samples you want, more gives more accurate results that are less affected by random chance.) Then, for each random point, evaluate the probability density at that random point using pdf.(randomSample). Then, pick the 95% of points with the highest probability density; include all of these points in the highest density interval, and any points between them (I'm assuming you're dealing with a single-moded distribution like the normal).
There are better ways to do this for the normal distribution, but they're harder to generalize.
You're looking for ArviZ.jl, together with Turing.jl's MCMCChains. MCMCChains will give you very basic plotting capabilities, e.g. a plot of the PDF estimated from each chain. ArviZ.jl (a wrapper around the Python ArviZ package) adds a lot more plots.

Impulse response functions for Threshold VAR in R

I have two variables (a financial stress index "CISS" and output growth).
Using the tsDyn package in R, I first calculated the TVAR. paperis the time series consisting of CISS and the output growth.
tvarpaper = TVAR(paper, lag=2, nthresh=1, thDelay=2, thVar= paper[,1])
I want to calculate the impulse response functions. Having used https://github.com/MatthieuStigler/tsDyn_GIRF, this is not exactly what I want to plot. I want to plot the IRFs for the low stress and the high stress regime separately with the corresponding confidence bands.
I first thought of splitting up the sample and then calculating the IRF with the normal irf function. In the following case I tried it for the high -stress regime.
SplitUPCISS <- paper[paper[,1] > -42.9926,]
tsSplitUPCISS <- ts(SplitUPCISS)
growthUPCISS <- VAR(SplitUPCISS, p=2)
SplitUPCISSIRF <- irf(growthUPCISS, impulse="tsyCISS12", reponse="tslogygdp12")
However, I am not 100% sure since there is hardly any movement if I plot it. Do I actually still need to calculate the VAR for the split up sample since I already calculated the tvar beforehand to find out about the threshold variable?

R - simulate data for probability density distribution obtained from kernel density estimate

First off, I'm not entirely sure if this is the correct place to be posting this, as perhaps it should go in a more statistics-focussed forum. However, as I'm planning to implement this with R, I figured it would be best to post it here. Please apologise if I'm wrong.
So, what I'm trying to do is the following. I want to simulate data for a total of 250.000 observations, assigning a continuous (non-integer) value in line with a kernel density estimate derived from empirical data (discrete), with original values ranging from -5 to +5. Here's a plot of the distribution I want to use.
It's quite essential to me that I don't simulate the new data based on the discrete probabilities, but rather the continuous ones as it's really important that a value can be say 2.89 rather than 3 or 2. So new values would be assigned based on the probabilities depicted in the plot. The most frequent value in the simulated data would be somewhere around +2, whereas values around -4 and +5 would be rather rare.
I have done quite a bit of reading on simulating data in R and about how kernel density estimates work, but I'm really not moving forward at all. So my question basically entails two steps - how do I even simulate the data (1) and furthermore, how do I simulate the data using this particular probability distribution (2)?
Thanks in advance, I hope you guys can help me out with this.
With your underlying discrete data, create a kernel density estimate on as fine a grid as you wish (i.e., as "close to continuous" as needed for your application (within the limits of machine precision and computing time, of course)). Then sample from that kernel density, using the density values to ensure that more probable values of your distribution are more likely to be sampled. For example:
Fake data, just to have something to work with in this example:
set.seed(4396)
dat = round(rnorm(1000,100,10))
Create kernel density estimate. Increase n if you want the density estimated on a finer grid of points:
dens = density(dat, n=2^14)
In this case, the density is estimated on a grid of 2^14 points, with distance mean(diff(dens$x))=0.0045 between each point.
Now, sample from the kernel density estimate: We sample the x-values of the density estimate, and set prob equal to the y-values (densities) of the density estimate, so that more probable x-values will be more likely to be sampled:
kern.samp = sample(dens$x, 250000, replace=TRUE, prob=dens$y)
Compare dens (the density estimate of our original data) (black line), with the density of kern.samp (red):
plot(dens, lwd=2)
lines(density(kern.samp), col="red",lwd=2)
With the method above, you can create a finer and finer grid for the density estimate, but you'll still be limited to density values at grid points used for the density estimate (i.e., the values of dens$x). However, if you really need to be able to get the density for any data value, you can create an approximation function. In this case, you would still create the density estimate--at whatever bandwidth and grid size necessary to capture the structure of the data--and then create a function that interpolates the density between the grid points. For example:
dens = density(dat, n=2^14)
dens.func = approxfun(dens)
x = c(72.4588, 86.94, 101.1058301)
dens.func(x)
[1] 0.001689885 0.017292405 0.040875436
You can use this to obtain the density distribution at any x value (rather than just at the grid points used by the density function), and then use the output of dens.func as the prob argument to sample.

Finding the intersection of two curves in a scatterplot (here: pvalues vs test-statistics)

i do
library(Hmisc)
df <- as.matrix(replicate(20, rnorm(20)))
cor.df <- rcorr(df)
plot(cor.df$r,cor.df$P)
abline(h=0.05)
and i would like to know if R can compute the meeting point of the horizontal line and the bell-curve. Since i have a scatterplot, do i need to model the x,y-curve first, and then balance the two functions? Or can R do that graphically?
I actually want to know what the treshold for (uncorrected) pvalues indicating a significant test statistics for a given dataset would be. I am not a trained statistician, so excuse me if that is a basic question.
Thank you very much!
There is no function to graphically calculate an intersection. There are functions like uniroot that you can use in R to find intersections, but you need to have proper functions and have a good idea of the interval where the intersection occurs.
It would be best to properly model the curve in question, but a simply way to approximate a function when you have a bunch of points on the curve is just to use linear interpolation between the observed points. You can create a function for your points with approxfun
f1 <- approxfun(cor.df$r,cor.df$P, rule=2)
(again, a proper model would be better, but just for the sake of example, i'll continue with this function).
Now we can find the place where this curve cross 0.05 with
uniroot(function(x) f1(x)-.05, c(-1,-.001))$root
# [1] -0.4437796
uniroot(function(x) f1(x)-.05, c(.001, 1))$root
# [1] 0.4440005

Interpolate new values using a set of samples

I'm new to R. Having a set of samples along with the target, I want to fit a numeric function to solve the target of new samples. My sample is time in seconds indicating the duration of a user's staying at this place:
>b <- c(101,25711,13451,19442,26,3083,133,184,4403,9713,6918,10056,12201,10624,14984,5241,
+21619,44285,3262,2115,1822,11291,3243,12989,3607,12882,4462,11553,7596,2926,12955,
+1832,3539,6897,13571,16668,813,1824,10304,2508,1493,4407,7820,507,15866,7442,7738,
+5705,2869,10137,11276,12884,11298,...)
Firstly, I convert them to hours dividing by 3600, and I want to fit a function as pdf of the duration:
> b <- b/3600
> hist(c,xlim=c(0,13),prob=T,breaks=seq(0,24,by=0.5))
> lines(density(x), col=red)
I want to fit the red line on the figure, and interpolate new values to find the probability of the specific duration on this place say p(duration = 1.5hours).
Thanks for your attention!
As suggested above, you can fit a distribution with fitdistr in MASS package.
If you use a continuous distribution you will have the probability that the time is within an interval. If you use a discrete distribution, you may compute the probability of a certain time (in hours).
For the continuous case, you can use a Gamma distribution: fitdistr(b, "Gamma") will give you the parameter estimates, and then you can use pgamma with those estimates and an interval.
For the discrete case, you can use a Poisson distribution: fitdistr(b, "Poisson") and then the dpois function with the estimate and the value you want.
To decide which one to use, I'd just plot the pdf with the histogram and take a look.

Resources