Octave distribution plots not working - plot

I am trying to plot the cdf of a uniform distribution in octave but I am not getting the cdf. I am simply getting the original distribution. Also the original distribution, which is meant to be a uniform distribution, is not a uniform distribution at all!
Here is my octave code:
x = unifrnd(0,1,100,1);
hist(x)
cdfPlot = unifcdf(x)
hist(cdfPlot)
The histogram for the 1st one (hist(x)):
and the second one (hist(cdfPlot)) :
I also tried to use cdfplot(x) in octave but it said :
warning: the 'cdfplot' function belongs to the statistics package from
Octave Forge but has not yet been implemented.
Please read http://www.octave.org/missing.html to learn how you can
contribute missing functionality.
please help!

Judging by the submitted code, what you are trying to do is obtain a sample from a uniform distribution and then show a flat (mostly) histogram corresponding to a uniform distribution and a line corresponding to the cumulative distribution of the distribution.
For the first part:
Of course, with 100 samples (and no averaging), you are not going to observe a flat distribution, but if you try:
x=unifrnd(0,1,100000,1);
hist(x);
Then you are more likely to get a flat-looking histogram.
For the second part:
unifcdf(x,A,B) will return the value of a uniform distribution's CDF at some value x, between the interval set by parameters A,B. That is, the value of the CDF model itself, NOT the cumulative sum of the sample's histogram. To obtain that, you need to:
x=unifrnd(0,1,100000,1);
[counts, intervals] = hist(x);
xCDF = cumsum(counts);
bar(xCDF);
Finally, if you are looking for the model values, that is the values that would be returned by a formula describing a distribution, then for the uniform distribution that would be a probability of (1/nBins) between your A, B interval (in this case, 0,1) and a count of (1/nBins)*NSamples, while the CDF would be a line of slope (1/nBins) (i.e. the interval of the density function) and of binNum*((1/nBins)*NSamples). In the example above and using the default nBins for hist which is 10, x is decomposed to 10 intervals each with an approximate number of counts of 10000 items of x and the last value of the cumulative sum is 100000 which is of course the total number of samples in x.
For more information please see this link.
Hope this helps.

Related

Generate beta-binomial distribution from existing vector

Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196

Why does dnorm() not return the standard deviation I inputted when I do sd(dnorm())?

This may be a dumb question, however I don't understand why sd(dnorm(1:100, mean=50, sd=15)) doesn't return the standard deviation as [1] 15.0 instead of what it actually returns which is [1] 0.009440673. When I do this with rnorm() sd(rnorm(100, mean=50, sd=15)) it returns what I would expect which is a number close to 15: [1] 17.00682. Can someone please explain why sd(dnorm(x,mean=mean,sd=sd)) doesn't return the standard deviation that I input to dnorm?
The dnorm function returns the density of the normal distribution with the mean (50) and standard deviation (15) you gave it.
On the other hand, rnorm will sample 100 numbers over a normal distribution, that's why you get standard deviations close to 15.
It's always helpful to plot your data. If you try hist(dnorm(1:100, mean=50, sd=15)) you'll see that the variability is very small (see below). As MkWTF points out, that's because dnorm returns the value of the probability density function of the normal distribution at value x given specified mean and sd.
rnorm in contrast generates random numbers with probability given by the probability density function of the normal distribution, which is why it allows a sensible estimate of the SD - the generated values follow that distribution.
The documentation for dnorm/pnorm/qnorm/rnorm is not great in my opinion (as someone who lacks a background in mathematics), but if you spend some time reading different online resources about these functions, and ensuring that you understand the meaning of the different underlying concepts (probability density functions, quantiles, random number generation, and (cumulative) distribution functions, it will become clear over time.
hist(dnorm(1:100, mean=50, sd=15))
Created on 2020-01-09 by the reprex package (v0.3.0)

Probability Density (pdf) extraction in R

I am attempting to reproduce the above function in R. The numerator has the product of the probability density function (pdf) of "y" at time "t". The omega_t is simply the weight (which for now lets ignore). The i stands for each forecast of y (along with the density) derived for model_i, at time t.
The denominator is the integral of the above product. My question is: How to estimate the densities. To get the density of the variable one needs some datapoints. So far I have this:
y<-c(-0.00604,-0.00180,0.00292,-0.0148)
forecastsy_model1<-c(-0.0183,0.00685) # respectively time t=1 and t=2 of the forecasts
forecastsy_model2<-c(-0.0163,0.00931) # similarly
all.y.1<-c(y,forecasty_model1) #together in one vector
all.y.2<-c(y,forecasty_model2) #same
However, I am not aware how to extract the density of x1 for time t=1, or t=6, in order to do the products. I have considered this to find the density estimated using this:
dy1<-density(all.y.1)
which(dy1$x==0.00685)
integer(0) #length(dy1$x) : 512
with dy1$x containing the n coordinates of the points where the density is estimated, according to the documentation. Shouldn't n be 6, or at least contain the points of y that I have supplied? What is the correct way to extract the density (pdf) of y?
There is an n argument in density which defaults to 512. density returns you estimated density values on a relatively dense grid so that you can plot the density curve. The grid points are determined by the range of your data (plus some extension) and the n value. They produce a evenly spaced grid. The sampling locations may not lie exactly on this grid.
You can use linear interpolation to get density value anywhere covered by this grid:
Find the probability density of a new data point using "density" function in R
Exact kernel density value for any point in R

R - simulate data for probability density distribution obtained from kernel density estimate

First off, I'm not entirely sure if this is the correct place to be posting this, as perhaps it should go in a more statistics-focussed forum. However, as I'm planning to implement this with R, I figured it would be best to post it here. Please apologise if I'm wrong.
So, what I'm trying to do is the following. I want to simulate data for a total of 250.000 observations, assigning a continuous (non-integer) value in line with a kernel density estimate derived from empirical data (discrete), with original values ranging from -5 to +5. Here's a plot of the distribution I want to use.
It's quite essential to me that I don't simulate the new data based on the discrete probabilities, but rather the continuous ones as it's really important that a value can be say 2.89 rather than 3 or 2. So new values would be assigned based on the probabilities depicted in the plot. The most frequent value in the simulated data would be somewhere around +2, whereas values around -4 and +5 would be rather rare.
I have done quite a bit of reading on simulating data in R and about how kernel density estimates work, but I'm really not moving forward at all. So my question basically entails two steps - how do I even simulate the data (1) and furthermore, how do I simulate the data using this particular probability distribution (2)?
Thanks in advance, I hope you guys can help me out with this.
With your underlying discrete data, create a kernel density estimate on as fine a grid as you wish (i.e., as "close to continuous" as needed for your application (within the limits of machine precision and computing time, of course)). Then sample from that kernel density, using the density values to ensure that more probable values of your distribution are more likely to be sampled. For example:
Fake data, just to have something to work with in this example:
set.seed(4396)
dat = round(rnorm(1000,100,10))
Create kernel density estimate. Increase n if you want the density estimated on a finer grid of points:
dens = density(dat, n=2^14)
In this case, the density is estimated on a grid of 2^14 points, with distance mean(diff(dens$x))=0.0045 between each point.
Now, sample from the kernel density estimate: We sample the x-values of the density estimate, and set prob equal to the y-values (densities) of the density estimate, so that more probable x-values will be more likely to be sampled:
kern.samp = sample(dens$x, 250000, replace=TRUE, prob=dens$y)
Compare dens (the density estimate of our original data) (black line), with the density of kern.samp (red):
plot(dens, lwd=2)
lines(density(kern.samp), col="red",lwd=2)
With the method above, you can create a finer and finer grid for the density estimate, but you'll still be limited to density values at grid points used for the density estimate (i.e., the values of dens$x). However, if you really need to be able to get the density for any data value, you can create an approximation function. In this case, you would still create the density estimate--at whatever bandwidth and grid size necessary to capture the structure of the data--and then create a function that interpolates the density between the grid points. For example:
dens = density(dat, n=2^14)
dens.func = approxfun(dens)
x = c(72.4588, 86.94, 101.1058301)
dens.func(x)
[1] 0.001689885 0.017292405 0.040875436
You can use this to obtain the density distribution at any x value (rather than just at the grid points used by the density function), and then use the output of dens.func as the prob argument to sample.

Interpolate new values using a set of samples

I'm new to R. Having a set of samples along with the target, I want to fit a numeric function to solve the target of new samples. My sample is time in seconds indicating the duration of a user's staying at this place:
>b <- c(101,25711,13451,19442,26,3083,133,184,4403,9713,6918,10056,12201,10624,14984,5241,
+21619,44285,3262,2115,1822,11291,3243,12989,3607,12882,4462,11553,7596,2926,12955,
+1832,3539,6897,13571,16668,813,1824,10304,2508,1493,4407,7820,507,15866,7442,7738,
+5705,2869,10137,11276,12884,11298,...)
Firstly, I convert them to hours dividing by 3600, and I want to fit a function as pdf of the duration:
> b <- b/3600
> hist(c,xlim=c(0,13),prob=T,breaks=seq(0,24,by=0.5))
> lines(density(x), col=red)
I want to fit the red line on the figure, and interpolate new values to find the probability of the specific duration on this place say p(duration = 1.5hours).
Thanks for your attention!
As suggested above, you can fit a distribution with fitdistr in MASS package.
If you use a continuous distribution you will have the probability that the time is within an interval. If you use a discrete distribution, you may compute the probability of a certain time (in hours).
For the continuous case, you can use a Gamma distribution: fitdistr(b, "Gamma") will give you the parameter estimates, and then you can use pgamma with those estimates and an interval.
For the discrete case, you can use a Poisson distribution: fitdistr(b, "Poisson") and then the dpois function with the estimate and the value you want.
To decide which one to use, I'd just plot the pdf with the histogram and take a look.

Resources