Using R Program binomial distribution probabilty mass function

Using R Program binomial distribution probabilty mass function - r

How to plot full probability mass function barplot for binomial distribution in R program?? My question below.
Suppose you are rolling a die with success defined as getting a 4. If you roll the die independently eight times
Plot the corresponding full probability mass function for X for this die-rolling example(Hint: because of the discrete nature of X, it is easy to use the barplot function for this).

Number of trials being 8
d <- 0:8
pd <- dbinom(d, 8, 1/6)
barplot(pd ~ d, type="h",col='blue', xlab="x", ylab="p(x)",
main="PMF for Binomial (n=8, p=1/6)")

Related

chi-square distribution R

Trying to fit a chi_square distribution using fitdistr() in R. Documentation on this is here (and not very useful to me): https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html
Question 1: chi_df below has the following output: 3.85546875 (0.07695236). What is the second number? The Variance or standard deviation?
Question 2: fitdistr generates 'k' defined by the Chi-SQ distribution. How do I fit the data so I get the scaling constant 'A'? I am dumbly using lines 14-17 below. Obviously not good.
Question 3: Is the Chi-SQ distribution only defined for a certain x-range? (Variance is defined as 2K, while mean = k. This must require some constrained x-range... Stats question not programming...)
nnn = 1000;
## Generating a chi-sq distribution
chii <- rchisq(nnn,4, ncp = 0);
## Plotting Histogram
chi_hist <- hist(chii);
## Fitting. Gives probability density which must be scaled.
chi_df <- fitdistr(chii,"chi-squared",start=list(df=3));
chi_k <- chi_df[[1]][1];
## Plotting a fitted line:
## Spanning x-length of chi-sq data
x_chi_fit <- 1:nnn*((max(chi_hist[[1]][])-min(chi_hist[[1]][]))/nnn);
## Y data using eqn for probability function
y_chi_fit <- (1/(2^(chi_k/2)*gamma(chi_k/2)) * x_chi_fit^(chi_k/2-1) * exp(-x_chi_fit/2));
## Normalizing to the peak of the histogram
y_chi_fit <- y_chi_fit*(max(chi_hist[[2]][]/max(y_chi_fit)));
## Plotting the line
lines(x_chi_fit,y_chi_fit,lwd=2,col="green");
Thanks for your help!

As commented above, ?fitdistr says
An object of class ‘"fitdistr"’, a list with four components,
...
sd: the estimated standard errors,
... so that parenthesized number is the standard error of the parameter.
The scale parameter doesn't need to be estimated; you need either to scale by the width of your histogram bins or just use freq=FALSE when drawing your histogram. See code below.
The chi-squared distribution is defined on the non-negative reals, which makes sense since it's the distribution of a squared standard Normal (this is a statistical, not a programming question).
Set up data:
nnn <- 1000
## ensure reproducibility; not a big deal in this case,
## but good practice
set.seed(101)
## Generating a chi-sq distribution
chii <- rchisq(nnn,4, ncp = 0)
Fitting.
library(MASS)
## use method="Brent" based on warning
chi_df <- fitdistr(chii,"chi-squared",start=list(df=3),
method="Brent",lower=0.1,upper=100)
chi_k <- chi_df[[1]][1]
(For what it's worth, it looks like there might be a bug in the print method for fitdistr when method="Brent" is used. You could also use method="BFGS" and wouldn't need to specify bounds ...)
Histograms
chi_hist <- hist(chii,breaks=50,col="gray")
## scale by N and width of histogram bins
curve(dchisq(x,df=chi_k)*nnn*diff(chi_hist$breaks)[1],
add=TRUE,col="green")
## or plot histogram already scaled to a density
chi_hist <- hist(chii,breaks=50,col="gray",freq=FALSE)
curve(dchisq(x,df=chi_k),add=TRUE,col="green")

Plot density of a distribution

Is it possible to plot in R the density function of a distribution?
For example suppose that I want to plot the density function of a Normal(0,5) or a Gamma(5,5).

f <- function(x) dnorm(x,0,5)
g <- function(x) dgamma(x,5,5)
par(mfrow=c(1,2)) # set up gaphics window for 2 plots
plot(f,xlim=c(-15,15),main="N(0,5)")
plot(g,xlim=c(0,3),main="Gamma(5,5)")
In R the distribution functions follow a pattern. For instance, for the normal distribution, the PDF is dnorm(...), the CDF is pnorm(...), the inverse CDF (quantile function) is qnorm(...), and the random number generator is rnorm(...).
One thing to watch out for is that R's convention for the arguments does not necessarily match what you find, for instance, on Wikipedia. For instance the arguments to dgamma(...) are x, shape, and rate, not x, k, and theta.

Simulate from an (arbitrary) continuous probability distribution [duplicate]

This question already has answers here:
How do I best simulate an arbitrary univariate random variate using its probability function?
(4 answers)
Closed 8 years ago.
For a normalized probability density function defined on the real line, for example
p(x) = (2/pi) * (1/(exp(x)+exp(-x))
(this is just an example; the solution should apply for any continuous PDF we can define) is there a package in R to simulate from the distribution? I am aware of R's built-in simulators for many distributions.
I could numerically compute the inverse cumulative distribution function at a set of quantiles, store them in a table, and use the table to map from uniform variates to variates from the desired distribution. Is there already a package that does this?

Here is a way using the distr package, which is designed for this.
library(distr)
p <- function(x) (2/pi) * (1/(exp(x)+exp(-x))) # probability density function
dist <-AbscontDistribution(d=p) # signature for a dist with pdf ~ p
rdist <- r(dist) # function to create random variates from p
set.seed(1) # for reproduceable example
X <- rdist(1000) # sample from X ~ p
x <- seq(-10,10, .01)
hist(X, freq=F, breaks=50, xlim=c(-5,5))
lines(x,p(x),lty=2, col="red")
You can of course also do this is base R using the methodology described in any one of the links in the comments.

If this is the function that you're dealing with, you could just take the integral (or, if you're rusty on your integration rules like me, you could use a tool like Wolfram Alpha to do it for you).
In the case of the function provided, you can simulate with:
draw.val <- function(numdraw) log(tan(pi*runif(numdraw)/2))
A histogram confirms that we're sampling correctly:
hist(draw.val(10000), breaks=100, probability=T)
x <- seq(-10, 10, .001)
lines(x, (2/pi) * (1/(exp(x)+exp(-x))), col="red")

Probability transformation using R

I want to turn a continuous random variable X with cdf F(x) into a continuous random variable Y with cdf F(y) and am wondering how to implement it in R.
For example, perform a probability transformation on data following normal distribution (X) to make it conform to a desirable Weibull distribution (Y).
(x=0 has CDF F(x=0)=0.5, CDF F(y)=0.5 corresponds to y=5, then x=0 corresponds to y=5 etc.)

There are many built in distribution functions, those starting with a 'p' will transform to a uniform and those starting with a 'q' will transform from a uniform. So the transform in your example can be done by:
y <- qweibull( pnorm( x ), 2, 6.0056 )
Then just change the functions and/or parameters for other cases.
The distr package may also be of interest for additional capabilities.

In general, you can transform an observation x on X to an observation y on Y by
getting the probability of X≤x, i.e. FX(x).
then determining what observation y has the same probability,
I.e. you want the probability Y≤y = FY(y) to be the same as FX(x).
This gives FY(y) = FX(x).
Therefore y = FY-1(FX(x))
where FY-1 is better known as the quantile function, QY. The overall transformation from X to Y is summarized as: Y = QY(FX(X)).
In your particular example, from the R help, the distribution functions for the normal distribution is pnorm and the quantile function for the Weibull distribution is qweibull, so you want to first of all call pnorm, then qweibull on the result.

Plotting Probability Density / Mass Function of Dataset in R

I have a dataset and I want to analyse these data with a probability density function or a probability mass function in R. I used a density function but it didn't gave me the probability.
My data are like this:
"step","Time","energy"
1, 22469 , 392.96E-03
2, 22547 , 394.82E-03
3, 22828,400.72E-03
4, 21765, 383.51E-03
5, 21516, 379.85E-03
6, 21453, 379.89E-03
7, 22156, 387.47E-03
8, 21844, 384.09E-03
9 , 21250, 376.14E-03
10, 21703, 380.83E-03
I want to the get PDF/PMF for the energy vector ; the data we take into account are discrete in nature so I don't have any special type for the distribution of the data.

Your data looks far from discrete to me. Expecting a probability when working with continuous data is plain wrong. density() gives you an empirical density function, which approximates the true density function. To prove it is a correct density, we calculate the area under the curve :
energy <- rnorm(100)
dens <- density(energy)
sum(dens$y)*diff(dens$x[1:2])
[1] 1.000952
Given some rounding error. the area under the curve sums up to one, and hence the outcome of density() fulfills the requirements of a PDF.
Use the probability=TRUE option of hist or the function density() (or both)
eg :
hist(energy,probability=TRUE)
lines(density(energy),col="red")
gives
If you really need a probability for a discrete variable, you use:
x <- sample(letters[1:4],1000,replace=TRUE)
prop.table(table(x))
x
a b c d
0.244 0.262 0.275 0.219
Edit : illustration why the naive count(x)/sum(count(x)) is not a solution. Indeed, it's not because the values of the bins sum to one, that the area under the curve does. For that, you have to multiply with the width of the 'bins'. Take the normal distribution, for which we can calculate the PDF using dnorm(). Following code constructs a normal distribution, calculates the density, and compares with the naive solution :
x <- sort(rnorm(100,0,0.5))
h <- hist(x,plot=FALSE)
dens1 <- h$counts/sum(h$counts)
dens2 <- dnorm(x,0,0.5)
hist(x,probability=TRUE,breaks="fd",ylim=c(0,1))
lines(h$mids,dens1,col="red")
lines(x,dens2,col="darkgreen")
Gives :
The cumulative distribution function
In case #Iterator was right, it's rather easy to construct the cumulative distribution function from the density. The CDF is the integral of the PDF. In the case of the discrete values, that simply the sum of the probabilities. For the continuous values, we can use the fact that the intervals for the estimation of the empirical density are equal, and calculate :
cdf <- cumsum(dens$y * diff(dens$x[1:2]))
cdf <- cdf / max(cdf) # to correct for the rounding errors
plot(dens$x,cdf,type="l")
Gives :