Observation in a bivariate Ellipse - r

I am trying find the probability that a point lies within an ellipse?
For eg if I was plotting the bivariate data (x,y) for 300 datasets in an 95% ellipsoid region, how do I calculate how many times out of 300 will my points fall inside the
ellipse?
Heres the code I am using
library(MASS)
seed<-1234
x<-NULL
k<-1
Sigma2 <- matrix(c(.72,.57,.57,.46),2,2)
Sigma2
rho <- Sigma2[1,2]/sqrt(Sigma2[1,1]*Sigma2[2,2])
rho
eta1<-replicate(300,mvrnorm(k, mu=c(-1.59,-2.44), Sigma2))
library(car)
dataEllipse(eta1[1,],eta1[2,], levels=c(0.05, 0.95))
Thanks for your help.

I don't see why people are jumping on the OP. In context, it's clearly a programming question: it's about getting the empirical frequency of data points within a given ellipse, not a theoretical probability. The OP even posted code and a graph showing what they're trying to obtain.
It may be that they don't fully understand the statistical theory behind a 95% ellipse, but they didn't ask about that. Besides, making plots and calculating frequencies like this is an excellent way of coming to grips with the theory.
Anyway, here's some code that answers the narrowly-defined question of how to count the points within an ellipse obtained via a normal distribution (which is what underlies dataEllipse). The idea is to transform your data to the unit circle via principal components, then get the points within a certain radius of the origin.
within.ellipse <- function(x, y, plot.ellipse=TRUE)
{
if(missing(y) && is.matrix(x) && ncol(x) == 2)
{
y <- x[,2]
x <- x[,1]
}
if(plot.ellipse)
dataEllipse(x, y, levels=0.95)
d <- scale(prcomp(cbind(x, y), scale.=TRUE)$x)
rad <- sqrt(2 * qf(.95, 2, nrow(d) - 1))
mean(sqrt(d[,1]^2 + d[,2]^2) < rad)
}
It was also commented that a 95% data ellipse contains 95% of the data by definition. This is certainly not true, at least for normal-theory ellipses. If your distribution is particularly bad, the coverage frequency may not even converge to the assumed level as the sample size increases. Consider a generalised pareto distribution, for example:
library(evd) # for rgpd
# generalised pareto has no variance for shape > 0.5
z <- sapply(1:1000, function(...) within.ellipse(rgpd(100, shape=5), rgpd(100, shape=5), FALSE))
mean(z)
[[1] 0.97451
z <- sapply(1:1000, function(...) within.ellipse(rgpd(10000, shape=5), rgpd(10000, shape=5), FALSE))
mean(z)
[1] 0.9995808

Related

How to plot bivariate density from first principles in R?

I am trying to plot the following bivariate density from first principles in R:
I have attempted to code this using the sn package, however it does not work. This is a geometric mixture of the multivariate normal distribution and it is not of the skewing mechanism form that was popularised by A. Azzalini. My code is as follows:
library(matlib)
#Make 2-D Grid Coordinates:
number_cor <- 1000 #the number of coordinates
x <- y <- seq(-5, 5, length.out=number_cor) #x and y coordinates
X <- cbind(u=grid2d(x)$x, v=grid2d(x)$y) #combining the coordinates
#Set Parameters:
d<-2
p<-0.75
mu<-as.vector(c(0,0))
cov<-matrix(c(2,0,0,2), nrow=2, ncol=2)
#Summation bounds:
lower <- 1
upper <- 10
#First calculate the density values
mvgsn_pdf_nosum <- function(k, i) {
((p*((1-p)**(k-1)))/(((2*pi)**(d/2))*(cov**0.5)*(k**(d/2))))*(exp((-1/(2*k))*(t(i-k*mu))*(inv(cov))*(i-k*mu)))
}
mvgsn_pdf <- sapply(X,
function(i) sum(mvgsn_pdf_nosum(seq(lower, upper, 1), i=i)))
The above approach works perfectly in the univariate case. However, my problem arises when I run the mvgsn_pdf step in the above code. The density is supposed to look like the following:
I have never attempted to code a bivariate density from first principles (i.e. hardcoding the PDF). Any help would kindly be appreciated.

Maximum pseudo-likelihood estimator for soft-core point process

I am trying to fit a soft-core point process model on a set of point pattern using maximum pseudo-likelihood. I followed the instructions given in this paper by Baddeley and Turner
And here is the R-code I came up with
`library(deldir)
library(tidyverse)
library(fields)
#MPLE
# irregular parameter k
k <- 0.4
## Generate dummy points 50X50. "RA" and "DE" are x and y coordinates
dum.x <- seq(ramin, ramax, length = 50)
dum.y <- seq(demin, demax, length = 50)
dum <- expand.grid(dum.x, dum.y)
colnames(dum) <- c("RA", "DE")
## Combine with data and specify which is data point and which is dummy, X is the point pattern to be fitted
bind.x <- bind_rows(X, dum) %>%
mutate(Ind = c(rep(1, nrow(X)), rep(0, nrow(dum))))
## Calculate Quadrature weights using Voronoi cell area
w <- deldir(bind.x$RA, bind.x$DE)$summary$dir.area
## Response
y <- bind.x$Ind/w
# the sum of distances between all pairs of points (the sufficient statistics)
tmp <- cbind(bind.x$RA, bind.x$DE)
t1 <- rdist(tmp)^(-2/k)
t1[t1 == Inf] <- 0
t1 <- rowSums(t1)
t <- -t1
# fit the model using quasipoisson regression
fit <- glm(y ~ t, family = quasipoisson, weights = w)
`
However, the fitted parameter for t is negative which is obviously not a correct value for a softcore point process. Also, my point pattern is actually simulated from a softcore process so it does not make sense that the fitted parameter is negative. I tried my best to find any bugs in the code but I can't seem to find it. The only potential issue I see is that my sufficient statistics is extremely large (on the order of 10^14) which I fear may cause numerical issues. But the statistics are large because my observation window spans a very small unit and the average distance between a pair of points is around 0.006. So sufficient statistics based on this will certainly be very large and my intuition tells me that it should not cause a numerical problem and make the fitted parameter to be negative.
Can anybody help and check if my code is correct? Thanks very much!

How to calculate the area under each end of a sine curve

Given this data set:
y<-c(-13,16,35,40,28,36,43,33,40,33,22,-5,-27,-31,-29,-25,-26,-31,-26,-24,-25,-29,-23,4)
t<-1:24
My goal is to calculate two areas. The first area would integrate only data from the first part of the curve found above the Zero line. The second area would integrate data from the second part of the curve found below the zero line.
First I would like to fit a sine wave to this data. Using this excellent answer:
https://stats.stackexchange.com/questions/60994/fit-a-sinusoidal-term-to-data
I was able to fit a sine wave (I will be using the periodic with second harmonic which looks to have a better fit)
ssp <- spectrum(y)
per <- 1/ssp$freq[ssp$spec==max(ssp$spec)]
reslm <- lm(y ~ sin(2*pi/per*t)+cos(2*pi/per*t))
summary(reslm)
rg <- diff(range(y))
plot(y~t,ylim=c(min(y)-0.1*rg,max(y)+0.1*rg))
lines(fitted(reslm)~t,col=4,lty=2) # dashed blue line is sin fit
# including 2nd harmonic really improves the fit
reslm2 <- lm(y ~ sin(2*pi/per*t)+cos(2*pi/per*t)+sin(4*pi/per*t)+cos(4*pi/per*t))
summary(reslm2)
lines(fitted(reslm2)~t,col=3) # solid green line is periodic with second harmonic
abline(h=0,lty=2)
Next I would like to calculate the area under the curve that is only positive, as well as the area under the curve that is exclusively negative. I've had luck looking at similar answers using the AUC functions in the Bolstad2 and Mess packages. But my data points do not fall neatly on zero line, and I do not know how to break up the sine function into areas only above the Zero line and only below the Zero line.
First things first. To get an exact calculation, you will need to work with the exact function of the 2nd harmonic fourier. Secondly, the beauty of harmonics functions is that they are repetitive. So if you want to find where your function reaches 0, you merely need to expand your interval to so you can be sure to find more than 2 roots.
First we get the exact function from the regression model
fourierfnct <- function(t){
fnct <- reslm2$coeff[1]+
reslm2$coeff[2]*sin(2*pi/per*t)+
reslm2$coeff[3]*cos(2*pi/per*t)+
reslm2$coeff[4]*sin(4*pi/per*t)+
reslm2$coeff[5]*cos(4*pi/per*t)
return(fnct)
}
secondly,you can write a function which can find the roots (where the function is 0). R provides a uniroot function which you can use to find multiple roots in a loop.
manyroots <- function(f,inter,period){
roots <- array(NA, inter)
for(i in 1:(length(inter)-1)){
roots[i] <- tryCatch({
return_value <- uniroot(f,c(inter[i],inter[i+1]))$root
}, error = function(err) {
return_value <- -1
})
}
retroots <- roots[-which(roots==-1)]
return(retroots)
}
then you simply calculate the roots, and use them to integrate the function across those boundaries.
roots <- manyroots(fourierfnct,seq(0,25),per)
integrate(fourierfnct, roots[1],roots[2])
#300.6378 with absolute error < 3.3e-12
integrate(fourierfnct, roots[2],roots[3])
#-284.6378 with absolute error < 3.2e-12
This may not be the solution you are looking for, but you could try this:
# Create a new t vector but with more subdivisions
t2 = seq(1,24,length.out = 10000)
# Evaluate your model on this t2
y2 = predict(reslm2, newdata = data.frame(t = t2))
lines(t2[y2>=0],y2[y2>=0],col="red")
# Estimate the area where the curve is greater than 0
sum(diff(t2)[1]*y2[y2>0])
# Estimate the area where the curve is less than 0
sum(diff(t2)[1]*y2[y2<0])

Fit distribution to given frequency values in R

I have frequency values changing with the time (x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution.
Q: Assuming that these frequency points are from Weibull distribution T, how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it?
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
plot(1:length(sample), sample, type = "l")
points(1:length(sample), sample)
Update.
To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time (x axis units) I mean I have data which says that I have:
7787 realizations of value 1
3056 realizations of value 2
2359 realizations of value 3 ... etc.
Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations:
# Loop to simulate values
set.values <- c()
for(i in 1:length(sample)){
set.values <<- c(set.values, rep(i, times = sample[i]))
}
hist(set.values)
lines(1:length(sample), sample)
points(1:length(sample), sample)
and use fitdistr on the set.values:
f2 <- fitdistr(set.values, 'weibull')
f2
Why I think it is incorrect way and why I am looking for a better solution in R?
in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T
in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function)
Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over.
Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot.
If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot. I think this is a good fit, it is in any case a local minimum in the interior of the constraining box.
library(optimx)
sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
t.sample <- 0:22
s.fit <- sample[3:23]
t.fit <- t.sample[3:23]
wx <- function(param) {
res <- param[2]*dweibull(t.fit*param[3],shape=param[1])
return(res)
}
minwx <- function(param){
v <- s.fit-wx(param)
sqrt(sum(v*v))
}
p0 <- c(1,200,1/20)
paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1))
popt <- paramopt$par
popt
rms <- paramopt$value
tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms)
plot(t.sample[2:23], sample[2:23], type = "p",col="darkred")
lines(t.fit, wx(popt),col="blue")
title(main=tit)
You can directly calculate the maximum likelihood parameters, as described here.
# Defining the error of the implicit function
k.diff <- function(k, vec){
x2 <- seq(length(vec))
abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2),
w = x2^k*sample))
}
# Setting the error to "quite zero", fulfilling the equation
k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min
# Calculate lambda, given k
l <- weighted.mean(seq(length(sample))^k, w = sample)
# Plot
plot(density(rep(seq(length(sample)),sample)))
x <- 1:25
lines(x, dweibull(x, shape=k, scale= l))
Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this:
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
f<-fitdistr(sample, 'weibull')
f
If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best.
For your example this would look like this:
ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2])
ks
The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution.
Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option.
f<-fitdistr(sample, 'weibull')
z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2])
hist(z)
f<-fitdistr(sample, 'exponential')
z = rexp(10000, f$estimate[1])
hist(z)

Superimposing gamma distribution curve to a plot

I have to superimpose the gamma distribution curve to a plot of others powerlaw-like curve.
I first plot dot points of the histogram in log-log scale
plot(log(pp$mids),log(pp$density))
then i would like to superimpose my gamma distribution curve calling an external function gamma()
gamma <- function(X)
{
n <- length(X)
theta<-var(hh2$V1)/mean(hh2$V1)
kappa<-mean(hh2$V1)/theta
y<-rgamma(n,kappa,theta)
xx<-hist(y,plot=F)
curve(log(xx$density),add=T,col='violet',type='l')
return( c(kappa) )
}
but this return me an error because curve() needs a true curve to plot. How can i do this?
Here is a somewhat-working variant of your code:
Generate an example structured as (I guess) your data are:
library(rmutil) ## for rpareto
set.seed(101)
hh2 <- data.frame(V1=rpareto(1000, m=1, s=1.5))
Initial histogram calculation:
pp <- hist(hh2$V1,plot=FALSE)
Function (better not to call it gamma as that masks a built-in function):
ghistfun <- function(x) {
n <- length(x)
scalepar <- var(x)/mean(x)
shapepar <- mean(x)^2/var(x)
y <- rgamma(n,shape=shapepar,scale=scalepar)
xx <- hist(y,plot=FALSE)
lines(log(xx$mids),log(xx$density),col="red")
curve(dgamma(exp(x),shape=shapepar,scale=scalepar,log=TRUE),
add=TRUE,col="blue")
shapepar
}
It might be better to use a very large number for n rather than just using the length of the data, unless you are particularly interested in seeing the random fluctuations in a data set of precisely the same size. Alternatively you could just use curve(dgamma(x,...)), as shown (I initially thought that you would have to allow for scaling from a density of x to a density of log(x), but because of the way you have computed a histogram on the original (unlogged) scale and then transformed the bin midpoints, you don't have to ...)
plot(log(pp$mids),log(pp$density))
ghistfun(hh2$V1)

Resources