How to find confidence interval in 3d plots in cran R? - r

I have this 3d plot that I draw in cran R
rb = rep(seq(0.1, 1, 0.1), 10)
ro = sort(rb)
lods = runif(100) #create a random LOD score
library(scatterplot3d)
lodsplot<- scatterplot3d(rb, ro, lods)
I found the maximum of the LOD score using max(lods) and thus, find the respective rb and ro. Now, I want to find the 95% CI of rb and ro. Assume max(lods) = 0.8 and respective rb and ro are 0.2 and 0.3, I thought of drawing a plane using:
lodsplot$plane3d(c(0.2, 0.3, 0.8))
and then find points above the plane (which I don't know how to do). Am I thinking correctly? Thank you!
Note:
If I just do a 2d plot, this is how i would do it:
plot(rb, lods, type = "l)
which(lods == max(lods))
limit = max(lods) - 1.92
abline(h = limit)
#Find intersect points:
above <- lr > limit
intersect.points <- which(diff(above) != 0)

You need to find the points that are above your plane defining the hypothesized 95% upper bound which you are suggesting has the equation:
lods = 0.2+ 0.3*rb+ 0.8*ro
So calculate the item numbers for the points satisfying the implicit inequality:
high <- which(lods > 0.2+ 0.3*rb+ 0.8*ro)
And plot:
png()
lodsplot<- scatterplot3d(rb, ro, lods)
high <- which(lods > 0.2+ 0.3*rb+ 0.8*ro)
lodsplot$plane3d(c(0.2, 0.3, 0.8))
lodsplot$points3d( rb[high], ro[high], lods[high], col="red"); dev.off()
Notice that the plane3d function in scatterplot3d also accepts a result from lm or glm, so you could first calculate a model where lods ~ rb +ro and then calculate a 95% prediction surface using predict( ..., type="response") and color the points using this method. See: predict and multiplicative variables / interaction terms in probit regressions for a worked example on an equivalent procedure on an admittedly more complex model.
You can also do a search on [r] prediction surface and find other potentially useful answers such as this BenBolker suggestion to use rgl: "A: scatterplot3d for Response Surface in R"

Related

How can I use cubic splines for extrapolation?

I am looking to use natural cubic splines to interpolate between some data points using stats::splinefun(). The documentation states:
"These interpolation splines can also be used for extrapolation, that is prediction at points outside the range of ‘x’. Extrapolation makes little sense for ‘method = "fmm"’; for natural splines it is linear using the slope of the interpolating curve at the nearest data point."
I have attempted to replicate the spline function in Excel as a review, which is working fine except that I can't replicate the extrapolation approach. Example data and code below:
library(stats)
# Example data
x <- c(1,2,3,4,5,6,7,8,9,10,12,15,20,25,30,40,50)
y <- c(7.1119,5.862,5.4432,5.1458,4.97,4.8484,4.7726,4.6673,4.5477,4.437,4.3163,4.1755,4.0421,3.9031,3.808,3.6594,3.663)
df <- data.frame(x,y)
# Create spline functions
splinetest <- splinefun(x = df$x, y = df$y, method = "natural")
# Create dataframe of coefficients
splinetest_coef <- environment(splinetest)$z
splinetest_coefdf <- data.frame(i = 0:16, x = splinecoef_inf$x, a = splinecoef_inf$y, b = splinecoef_inf$b, c = splinecoef_inf$c, d = splinecoef_inf$d)
# Calculate extrapolated value at 51
splinetest(51)
# Result:
# [1] 3.667414
Question: How is this result calculated?
Expected result using linear extrapolation from x = 40 and x = 50 is 3.663 + (51 - 50) x (3.663 - 3.6594) / (50 - 40) = 3.66336
The spline coefficients are as follows at i = 50: a = 3.663 and b = 0.00441355...
Therefore splinetest(51) is calculated as 3.663 + 0.0441355
How is 0.0441355 calculated in this function?
Linear extrapolation is not done by computing the slope between a particular pair of points, but by using the estimated derivatives at the boundary ("closest point" in R's documentation). The derivatives at any point can be calculated directly from the spline function, e.g. to calculate the estimated first derivative at the upper boundary:
splinetest(max(df$x), deriv = 1)
[1] 0.004413552
This agrees with your manual back-calculation of the slope used to do the extrapolation.
As pointed out in the comments, plotting the end of the curve/data set with curve(splinetest, from = 30, to = 60); points(x,y) illustrates clearly the difference between the derivative at the boundary (x=50) and the line based on the last two data points (i.e. (y(x=50) - y(x=40))/10)

precrec, Sensitivity and normalized rank of a perfect model

I have some problem interpreting the following graphs that plot Sensitivity vs Normalized Rank of a perfect model.
library(precrec)
p <- rbinom(100, 1, 0.5) # same vector for predictions and observations
prc <- evalmod(scores = p, labels = p, mode="basic")
autoplot(prc, c("Specificity", "Sensitivity"))
plot
I would expect that a perfect model would generate values of Specificity = Sensitivity = 1 for all the retrieved ranked documents and thus, a line with slope 0 and intercept 1. I am clearly missing something and/or misinterpreting the x axis label. Any hint?
Thanks
An explanation was provided by the creator of the package here

Method of Moments for Gamma distribution- histogram and superimposing the PDF

I have this question. 'Model the data in nfsold (nfsold is just a vector containing 150 numbers)as a set of 150independent observations from a Gamma(lambda; k) distribution. Use the Method of Moments, to obtain estimates of k and lambda. Draw a histogram of the data and superimpose the PDF of your fitted gamma distribution as a preliminary check that this distribution matches the observed data.'
This is the code I have written.
#The first moment of each Xi, i = 1,...,n, is E(Xi) = k/lamda.
#The second moment of each Xi is E(Xi^2) = k(k+1)/(lamda)^2
#Since we have to find 2 two things, k and lamda we require 2 moments to do this.
x_bar = mean = sum(nfsold)/150 #This is the first moment
mean
second_moment = sum(nfsold^2)/150
second_moment
#(1/n)(sum xi) = k/lamda
#(1/n)(sum x^2i) = k(k+1)/(lamda)^2
#By solving these because of the methods of moments we get lambda and k.
lamda_hat = (x_bar)/((second_moment)-(x_bar)^2)
lamda_hat
k_hat = (x_bar)^2/ ((second_moment)-(x_bar)^2)
k_hat
independent_observations = dgamma(x,k_hat, rate = lamda_hat)
hist( independent_observations, breaks = 15, prob = TRUE, main="Histogram for the Gamma Distribution of the data in nfsold", xlab="Independent Observations", ylab="P.D.F")
curve(dgamma(x,k_hat, rate =lamda_hat), add=TRUE, col="green")
My problem is that my superimposed curve does not follow my histogram, so I feel like there is something wrong with my code, please would I be able to have some help with correcting it?
Thanks!

Calculate probability of point on 2d density surface

If I calculate the 2d density surface of two vectors like in this example:
library(MASS)
a <- rnorm(1000)
b <- rnorm(1000, sd=2)
f1 <- kde2d(a, b, n = 100)
I get the following surface
filled.contour(f1)
The z-value is the estimated density.
My question now is: Is it possible to calculate the probability of a single point, e.g. a = 1, b = -4
[as I'm not a statistician this is maybe the wrong wording. Sorry for that. I would like to know - if this is possible at all - with which probability a point occurs.]
Thanks for every comment!
If you specify an area, then that area has a probability with respect to your density function. Of course a single point does not have a probability different from zero. But it does have a non-zero density at that point. What is that then?
The density is the limit of integral of that probability density integrated over the area divided by the normal area measure as the normal area measure goes to zero. (It was actual rather hard to state that correctly, needed a few tries and it is still not optimal).
All this is really basic calculus. It is also fairly easy to write a routine to calculate the integral of that density over the area, although I imagine MASS has standard ways to do it that use more sophisticated integration techniques. Here is a quick routine that I threw together based on your example:
library(MASS)
n <- 100
a <- rnorm(1000)
b <- rnorm(1000, sd=2)
f1 <- kde2d(a, b, n = 100)
lims <- c(min(a),max(a),min(b),max(b))
filled.contour(f1)
prob <- function(f,xmin,xmax,ymin,ymax,n,lims){
ixmin <- max( 1, n*(xmin-lims[1])/(lims[2]-lims[1]) )
ixmax <- min( n, n*(xmax-lims[1])/(lims[2]-lims[1]) )
iymin <- max( 1, n*(ymin-lims[3])/(lims[4]-lims[3]) )
iymax <- min( n, n*(ymax-lims[3])/(lims[4]-lims[3]) )
avg <- mean(f$z[ixmin:ixmax,iymin:iymax])
probval <- (xmax-xmin)*(ymax-ymin)*avg
return(probval)
}
prob(f1,0.5,1.5,-4.5,-3.5,n,lims)
# [1] 0.004788993
prob(f1,-1,1,-1,1,n,lims)
# [1] 0.2224353
prob(f1,-2,2,-2,2,n,lims)
# [1] 0.5916984
prob(f1,0,1,-1,1,n,lims)
# [1] 0.119455
prob(f1,1,2,-1,1,n,lims)
# [1] 0.05093696
prob(f1,-3,3,-3,3,n,lims)
# [1] 0.8080565
lims
# [1] -3.081773 4.767588 -5.496468 7.040882
Caveat, the routine seems right and is giving reasonable answers, but it has not undergone anywhere near the scrutiny I would give it for a production function.
The z-value here is a called a "probability density" rather than a "probability". As comments have pointed out, if you want an estimated probability you will need to integrate the estimated density to find the volume under your estimated surface.
However, if what you want is the probability density at a particular point, then you can use:
kde2d(a, b, n=1, lims=c(1, 1, -4, -4))$z[1,1]
# [1] 0.006056323
This will calculate a 1x1 "grid" with a single density estimate for the point you want.
A plot confirming that it worked:
z0 <- kde2d(a, b, n=1, lims=c(1, 1, -4, -4))$z[1,1]
filled.contour(
f1,
plot.axes = {
contour(f1, levels=z0, add=TRUE)
abline(v=1, lty=3)
abline(h=-4, lty=3)
axis(1); axis(2)
}
)

Observation in a bivariate Ellipse

I am trying find the probability that a point lies within an ellipse?
For eg if I was plotting the bivariate data (x,y) for 300 datasets in an 95% ellipsoid region, how do I calculate how many times out of 300 will my points fall inside the
ellipse?
Heres the code I am using
library(MASS)
seed<-1234
x<-NULL
k<-1
Sigma2 <- matrix(c(.72,.57,.57,.46),2,2)
Sigma2
rho <- Sigma2[1,2]/sqrt(Sigma2[1,1]*Sigma2[2,2])
rho
eta1<-replicate(300,mvrnorm(k, mu=c(-1.59,-2.44), Sigma2))
library(car)
dataEllipse(eta1[1,],eta1[2,], levels=c(0.05, 0.95))
Thanks for your help.
I don't see why people are jumping on the OP. In context, it's clearly a programming question: it's about getting the empirical frequency of data points within a given ellipse, not a theoretical probability. The OP even posted code and a graph showing what they're trying to obtain.
It may be that they don't fully understand the statistical theory behind a 95% ellipse, but they didn't ask about that. Besides, making plots and calculating frequencies like this is an excellent way of coming to grips with the theory.
Anyway, here's some code that answers the narrowly-defined question of how to count the points within an ellipse obtained via a normal distribution (which is what underlies dataEllipse). The idea is to transform your data to the unit circle via principal components, then get the points within a certain radius of the origin.
within.ellipse <- function(x, y, plot.ellipse=TRUE)
{
if(missing(y) && is.matrix(x) && ncol(x) == 2)
{
y <- x[,2]
x <- x[,1]
}
if(plot.ellipse)
dataEllipse(x, y, levels=0.95)
d <- scale(prcomp(cbind(x, y), scale.=TRUE)$x)
rad <- sqrt(2 * qf(.95, 2, nrow(d) - 1))
mean(sqrt(d[,1]^2 + d[,2]^2) < rad)
}
It was also commented that a 95% data ellipse contains 95% of the data by definition. This is certainly not true, at least for normal-theory ellipses. If your distribution is particularly bad, the coverage frequency may not even converge to the assumed level as the sample size increases. Consider a generalised pareto distribution, for example:
library(evd) # for rgpd
# generalised pareto has no variance for shape > 0.5
z <- sapply(1:1000, function(...) within.ellipse(rgpd(100, shape=5), rgpd(100, shape=5), FALSE))
mean(z)
[[1] 0.97451
z <- sapply(1:1000, function(...) within.ellipse(rgpd(10000, shape=5), rgpd(10000, shape=5), FALSE))
mean(z)
[1] 0.9995808

Resources