How to reuse my kernel density estimation function in R? - r

I use density() to do KDE,like
#Rscript#
x <- c(rep(1,3),rep(2,4),rep(3,5))
density(x)
Am I suppose to get a probability density function? If so, How do I reuse it to obtain the probability of 1 value e.g. what is the probability of x<=2 P(x<=2) under my KDE function?
Tanks for sharing your idea!

Because density() gives you the continous KDE, the probability of an exact value is zero. You can only get some information like P(x <= 1). In your case hist() should be the correct selection.
EDIT:
Please have a look here
https://stats.stackexchange.com/questions/78711/how-to-find-estimate-probability-density-function-from-density-function-in-r

Related

Generate beta-binomial distribution from existing vector

Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196

How to draw the normal distribution for Y|x

Now there are random variables X and Y which have following properties: E(X)=10 Var(X)=4 E(Y|x)=30-x/2 and Var(Y|x)=x
The question is: simulate 10000 realizations(x,y) from this model by assuming normal distribution for X and Y|x, and plot x on y
I only know use rnorm and dnorm function like this
x<-rnorm(10,mean=10,sd=2)
curve(dnorm(x),xlim=c(-5,5),ylim=c(0,0.5),col="red")
but how to deal dnorm(Y|x)
I am not sure this is right:
y<-rnorm(10,mean=(30-0.5*x),sd=sqrt(x))
because it show some error when I want to
curve(dnorm(y),xlim=c(-5,5),ylim=c(0,0.5))
You've already calculated the x's and y's correctly, your issue lies in the curve function. You need to pass a distribution with parameters to the curve function, not realisations of said distributions.
In your case it is easier to plot the distributions using a histogram.
x<-rnorm(1e4,mean=10,sd=2)
y<-rnorm(1e4,mean=(30-0.5*x),sd=sqrt(x))
hist(x)
hist(y)

Generate random data from arbitrary CDF in R?

I have an arbitrary CDF that is applied to a point estimate. I have a number of these point estimates with associated CDFs, that I need to simulate random data for a Monte Carlo simulation.
The CDF I'm generating by doing a spline fit to the arbitrary points provided in a table. For example, the quantile 0.1 is a product of 0.13 * point estimate. The quantile 0.9 is a product of 7.57 * point estimate. It is fairly crude and is based on a large study comparing these models to real world system -- ignore that for now please.
I fit the CDF using a spline fit as shown here.
If I take the derivative of this, I get the shape of the pdf (image).
I modified the function "samplepdf" found here, Sampling from an Arbitrary Density, as follows:
samplecdf <- function(n, cdf, spdf.lower = -Inf, spdf.upper=Inf) {
my_fun <- match.fun(cdf)
invcdf <- function(u) {
subcdf <- function(t) my_fun(t) - u
if (spdf.lower == -Inf)
spdf.lower <- endsign(subcdf, -1)
if (spdf.upper == Inf)
spdf.upper <- endsign(subcdf)
return(uniroot(subcdf, c(spdf.lower, spdf.upper))$root)
}
sapply(runif(n), invcdf)
}
This seems to work, OK - when I compare the quantiles I estimate from the randomly generated data they are fairly close to the initial values. However, when I look at the histogram something funny is happening at the tail where it is looks like my function is consistently generating more values than it should according to the pdf. This function consistently does that across all my point-estimates and even though I can look at the individual quantiles and they seem close, I can tell that the overall Monte Carlo simulation is demonstrating higher estimates for the 50% percentile than I expect. Here is a plot of my histogram of the random samples.
Any tips or advice would be very welcome. I think the best route would be to fit an exponential distribution to the CDF, but I'm struggling to do that. All "fitting" assumes that you have data that needs to be fitted -- this is more arbitrary than that.

Octave distribution plots not working

I am trying to plot the cdf of a uniform distribution in octave but I am not getting the cdf. I am simply getting the original distribution. Also the original distribution, which is meant to be a uniform distribution, is not a uniform distribution at all!
Here is my octave code:
x = unifrnd(0,1,100,1);
hist(x)
cdfPlot = unifcdf(x)
hist(cdfPlot)
The histogram for the 1st one (hist(x)):
and the second one (hist(cdfPlot)) :
I also tried to use cdfplot(x) in octave but it said :
warning: the 'cdfplot' function belongs to the statistics package from
Octave Forge but has not yet been implemented.
Please read http://www.octave.org/missing.html to learn how you can
contribute missing functionality.
please help!
Judging by the submitted code, what you are trying to do is obtain a sample from a uniform distribution and then show a flat (mostly) histogram corresponding to a uniform distribution and a line corresponding to the cumulative distribution of the distribution.
For the first part:
Of course, with 100 samples (and no averaging), you are not going to observe a flat distribution, but if you try:
x=unifrnd(0,1,100000,1);
hist(x);
Then you are more likely to get a flat-looking histogram.
For the second part:
unifcdf(x,A,B) will return the value of a uniform distribution's CDF at some value x, between the interval set by parameters A,B. That is, the value of the CDF model itself, NOT the cumulative sum of the sample's histogram. To obtain that, you need to:
x=unifrnd(0,1,100000,1);
[counts, intervals] = hist(x);
xCDF = cumsum(counts);
bar(xCDF);
Finally, if you are looking for the model values, that is the values that would be returned by a formula describing a distribution, then for the uniform distribution that would be a probability of (1/nBins) between your A, B interval (in this case, 0,1) and a count of (1/nBins)*NSamples, while the CDF would be a line of slope (1/nBins) (i.e. the interval of the density function) and of binNum*((1/nBins)*NSamples). In the example above and using the default nBins for hist which is 10, x is decomposed to 10 intervals each with an approximate number of counts of 10000 items of x and the last value of the cumulative sum is 100000 which is of course the total number of samples in x.
For more information please see this link.
Hope this helps.

Density plot in R sometimes gives frequency, other times probabilities?

Plotting the density of some of my data yields frequencies on the Y axis, while plotting the density of other data yields probabilities(?) on the Y axis. Is there an equivalent of freq=FALSE for density() like there is for hist() so I can have control over this? I've tried searching around for this specific issue, but I almost always end up getting hist() documentation instead of finding the answer to this specific question. Thank you!
Adding such a parameter to density would be statistically unwise for the reasons articulated by #MrFlick. If you want to convert a density estimate to be on the same scale as the observations, you can multiply by the length of the vector used for the density calculation. The density then becomes a "per x unit" estimate of "frequency". Compare the two plots:
set.seed(123);x <- sample(1:10, size=5 )
#> x
#[1] 3 8 4 7 6
plot(density(x))
plot(5*density(x)$y)
The "per unit of x" estimate is now in the correct (approximate) range of 0.5 (and it's integral should be roughly equal to the counts). It's only accidentally that an x value of a density would ever be similar to a probability. It should always be that the integral of the density is unity.
Perhaps you are looking for the ecdf function? Instead of returning a density , it provides a mechanism for constructing a cumulative probability function.

Resources