Probability transformation using R - r

I want to turn a continuous random variable X with cdf F(x) into a continuous random variable Y with cdf F(y) and am wondering how to implement it in R.
For example, perform a probability transformation on data following normal distribution (X) to make it conform to a desirable Weibull distribution (Y).
(x=0 has CDF F(x=0)=0.5, CDF F(y)=0.5 corresponds to y=5, then x=0 corresponds to y=5 etc.)

There are many built in distribution functions, those starting with a 'p' will transform to a uniform and those starting with a 'q' will transform from a uniform. So the transform in your example can be done by:
y <- qweibull( pnorm( x ), 2, 6.0056 )
Then just change the functions and/or parameters for other cases.
The distr package may also be of interest for additional capabilities.

In general, you can transform an observation x on X to an observation y on Y by
getting the probability of X≤x, i.e. FX(x).
then determining what observation y has the same probability,
I.e. you want the probability Y≤y = FY(y) to be the same as FX(x).
This gives FY(y) = FX(x).
Therefore y = FY-1(FX(x))
where FY-1 is better known as the quantile function, QY. The overall transformation from X to Y is summarized as: Y = QY(FX(X)).
In your particular example, from the R help, the distribution functions for the normal distribution is pnorm and the quantile function for the Weibull distribution is qweibull, so you want to first of all call pnorm, then qweibull on the result.

Related

How to draw the normal distribution for Y|x

Now there are random variables X and Y which have following properties: E(X)=10 Var(X)=4 E(Y|x)=30-x/2 and Var(Y|x)=x
The question is: simulate 10000 realizations(x,y) from this model by assuming normal distribution for X and Y|x, and plot x on y
I only know use rnorm and dnorm function like this
x<-rnorm(10,mean=10,sd=2)
curve(dnorm(x),xlim=c(-5,5),ylim=c(0,0.5),col="red")
but how to deal dnorm(Y|x)
I am not sure this is right:
y<-rnorm(10,mean=(30-0.5*x),sd=sqrt(x))
because it show some error when I want to
curve(dnorm(y),xlim=c(-5,5),ylim=c(0,0.5))
You've already calculated the x's and y's correctly, your issue lies in the curve function. You need to pass a distribution with parameters to the curve function, not realisations of said distributions.
In your case it is easier to plot the distributions using a histogram.
x<-rnorm(1e4,mean=10,sd=2)
y<-rnorm(1e4,mean=(30-0.5*x),sd=sqrt(x))
hist(x)
hist(y)

Efficiently calculating integral of a multivariate function on non-rectangular region?

I want to compute the expected value of a multivariate function f(x) wrt to dirichlet distribution. My problem is "penta-nomial" (i.e 5 variables) so calculating the explicit form of the expected value seems unreasonable. Is there a way to numerically integrate it efficiently?
f(x) = \sum_{0,4}(x_i*log(n/x_i))
x = <x_0, x_1, x_2, x_3, x_4> and n is a constant

Plot density of a distribution

Is it possible to plot in R the density function of a distribution?
For example suppose that I want to plot the density function of a Normal(0,5) or a Gamma(5,5).
f <- function(x) dnorm(x,0,5)
g <- function(x) dgamma(x,5,5)
par(mfrow=c(1,2)) # set up gaphics window for 2 plots
plot(f,xlim=c(-15,15),main="N(0,5)")
plot(g,xlim=c(0,3),main="Gamma(5,5)")
In R the distribution functions follow a pattern. For instance, for the normal distribution, the PDF is dnorm(...), the CDF is pnorm(...), the inverse CDF (quantile function) is qnorm(...), and the random number generator is rnorm(...).
One thing to watch out for is that R's convention for the arguments does not necessarily match what you find, for instance, on Wikipedia. For instance the arguments to dgamma(...) are x, shape, and rate, not x, k, and theta.

Plotting fitted values vs observed ones in R or winbugs

I want to plot the fitted values versus the observed ones and want to put straight line showing the goodness of fit. However, I do not want to use abline() because I did not calculate the fitted values using lm command as my I used a model that R does not cover. I calculated the coefficients and used them to calculate the fitted values. So, what can I do to obtain such a plot in R or in winbugs?
Here is what I want
Still no data provided, but maybe this simple example using the curve function will inform the process:
x <- 1:10
y <- 2+ 3*(1:10) + rnorm(10)
plot(1:10, y)
curve( 2+3*x, 0, 10, add=TRUE)
Note to new R users. the expression y_i = 1 - xbeta + delta_i + e_i would fail in R in part because the x and beta are not separated by an operator. But if you do understand R's matrix syntax it might be a very compact expression even if "X" were multidimensional. All of htis depends on the specifics which we are so far lacking.

Plotting Probability Density / Mass Function of Dataset in R

I have a dataset and I want to analyse these data with a probability density function or a probability mass function in R. I used a density function but it didn't gave me the probability.
My data are like this:
"step","Time","energy"
1, 22469 , 392.96E-03
2, 22547 , 394.82E-03
3, 22828,400.72E-03
4, 21765, 383.51E-03
5, 21516, 379.85E-03
6, 21453, 379.89E-03
7, 22156, 387.47E-03
8, 21844, 384.09E-03
9 , 21250, 376.14E-03
10, 21703, 380.83E-03
I want to the get PDF/PMF for the energy vector ; the data we take into account are discrete in nature so I don't have any special type for the distribution of the data.
Your data looks far from discrete to me. Expecting a probability when working with continuous data is plain wrong. density() gives you an empirical density function, which approximates the true density function. To prove it is a correct density, we calculate the area under the curve :
energy <- rnorm(100)
dens <- density(energy)
sum(dens$y)*diff(dens$x[1:2])
[1] 1.000952
Given some rounding error. the area under the curve sums up to one, and hence the outcome of density() fulfills the requirements of a PDF.
Use the probability=TRUE option of hist or the function density() (or both)
eg :
hist(energy,probability=TRUE)
lines(density(energy),col="red")
gives
If you really need a probability for a discrete variable, you use:
x <- sample(letters[1:4],1000,replace=TRUE)
prop.table(table(x))
x
a b c d
0.244 0.262 0.275 0.219
Edit : illustration why the naive count(x)/sum(count(x)) is not a solution. Indeed, it's not because the values of the bins sum to one, that the area under the curve does. For that, you have to multiply with the width of the 'bins'. Take the normal distribution, for which we can calculate the PDF using dnorm(). Following code constructs a normal distribution, calculates the density, and compares with the naive solution :
x <- sort(rnorm(100,0,0.5))
h <- hist(x,plot=FALSE)
dens1 <- h$counts/sum(h$counts)
dens2 <- dnorm(x,0,0.5)
hist(x,probability=TRUE,breaks="fd",ylim=c(0,1))
lines(h$mids,dens1,col="red")
lines(x,dens2,col="darkgreen")
Gives :
The cumulative distribution function
In case #Iterator was right, it's rather easy to construct the cumulative distribution function from the density. The CDF is the integral of the PDF. In the case of the discrete values, that simply the sum of the probabilities. For the continuous values, we can use the fact that the intervals for the estimation of the empirical density are equal, and calculate :
cdf <- cumsum(dens$y * diff(dens$x[1:2]))
cdf <- cdf / max(cdf) # to correct for the rounding errors
plot(dens$x,cdf,type="l")
Gives :

Resources