Generate positive real numbers from rpois() - r

I am trying to create a Poisson simulation using rpois(). I have a distribution of two decimal place interest rates and I want to find out if these have Poisson rather than a normal distribution.
The rpois() function returns positive integers. I want it to return two decimal place positive numbers instead. I have tried the following
set.seed(123)
trialA <- rpois(1000, 13.67) # generate 1000 numbers
mean(trialA)
13.22 # Great! Close enough to 13.67
var(trialA)
13.24 # terrific! mean and variance should be the same
head(trialA, 4)
6 7 8 14 # Oh no!! I want numbers with two decimals places...??????
# Here is my solution...but it has a problem
# 1) Scale the initial distribution by multiplying lambda by 100
trialB <- rpois(1000, 13.67 * 100)
# 2) Then, divide the result by 100 so I get a fractional component
trialB <- trialB / 100
head(trialB, 4) # check results
13.56 13.62 13.26 13.44 # terrific !
# check summary results
mean(trialB)
13.67059 # as expected..great!
var(trialB)
0.153057 # oh no!! I want it to be close to: mean(trialB) = 13.67059
How can I use rpois() to generate positive two decimal place numbers that have a Poisson distribution.
I know that Poisson distributions are used for counts and that counts are positive integers but I also believe that Poisson distributions can be used to model rates. And these rates could be just positive integers divided by a scalar.

If you scale a Poisson distribution to change its mean, the result is no longer Poisson, and the mean and variance are no longer equal -- if you scale the mean by a factor s, then the variance changes by a factor s^2.
You probably want to use the Gamma distribution. The mean of the Gamma is shape*scale and the variance is shape*scale^2, so you have to use scale=1 to get real, positive numbers with equal mean and variance:
set.seed(1001)
r <- rgamma(1e5,shape=13.67,scale=1)
mean(r) ## 13.67375
var(r) ## 13.6694
You can round to two decimal places without changing the mean and variance very much:
r2 <- round(r,2)
mean(r2) ## 13.67376
var(r2) ## 13.66938
Compare with a Poisson distribution:
par(las=1,bty="l")
curve(dgamma(x,shape=13.67,scale=1),from=0,to=30,
ylab="Probability or prob. density")
points(0:30,dpois(0:30,13.67),type="h",lwd=2)

Related

how to sample from an upside down bell curve

I can generate numbers with uniform distribution by using the code below:
runif(1,min=10,max=20)
How can I sample randomly generated numbers that fall more frequently closer to the minimum and maxium boundaries? (Aka an "upside down bell curve")
Well, bell curve is usually gaussian, meaning it doesn't have min and max. You could try Beta distribution and map it to desired interval. Along the lines
min <- 1
max <- 20
q <- min + (max-min)*rbeta(10000, 0.5, 0.5)
As #Gregor-reinstateMonica noted, Beta distribution is bounded on both ends, [0...1], so it could be easily mapped into any bounded interval just by scale and shift. It has two parameters, and symmetric if those parameters are equal. Above 1 parameters make it kind of bell distribution, but below 1 parameters make it into inverse bell, what you're looking for. You could play with them, put different values instead of 0.5 and see how it is going. Parameters equal to 1 makes it uniform.
Sampling from a beta distribution is a good idea. Another way is to sample a number of uniform numbers and then take the minimum or maximum of them.
According to the theory of order statistics, the cumulative distribution function for the maximum is F(x)^n where F is the cdf from which the sample is taken and n is the number of samples, and the cdf for the minimum is 1 - (1 - F(x))^n. For a uniform distribution, the cdf is a straight line from 0 to 1, i.e., F(x) = x, and therefore the cdf of the maximum is x^n and the cdf of the minimum is 1 - (1 - x)^n. As n increases, these become more and more curved, with most of the mass close to the ends.
A web search for "order statistics" will turn up some resources.
If you don't care about decimal places, a hacky way would be to generate a large sample of normally distributed datapoints using rnorm(), then count the number of times each given rounded value appears (n), and then substract n from the maximum value of n (max(n)) to get inverse counts.
You can then use the inverse count to make a new vector (that you can sample from), i.e.:
library(tidyverse)
x <- rnorm(100000, 100, 15)
x_tib <- round(x) %>%
tibble(x = .) %>%
count(x) %>%
mutate(new_n = max(n) - n)
new_x <- rep(x_tib$x, x_tib$new_n)
qplot(new_x, binwidth = 1)
An "upside-down bell curve" compared to the normal distribution can be sampled using the following algorithm. I write it in pseudocode because I'm not familiar with R. Notice that this sampler samples in a truncated interval (here, the interval [x0, x1]) because it's not possible for an upside-down bell curve extended to infinity to integrate to 1 (which is one of the requirements for a probability density).
In the pseudocode, RNDU01() is a uniform(0, 1) random number.
x0pdf = 1-exp(-(x0*x0))
x1pdf = 1-exp(-(x1*x1))
ymax = max(x0pdf, x1pdf)
while true
# Choose a random x-coordinate
x=RNDU01()*(x1-x0)+x0
# Choose a random y-coordinate
y=RNDU01()*ymax
# Return x if y falls within PDF
if y < 1-exp(-(x*x)): return x
end

Simulate Values that follow a Distribution Curve in R

I want to simulate demand values that follows different distributions (ex above: starts of linear> exponential>invlog>etc) I'm a bit confused by the notion of probability distributions but thought I could use rnorm, rexp, rlogis, etc. Is there any way I could do so?
I think it may be this but in R: Generating smoothed randoms that follow a distribution
Simulating random values from commonly-used probability distributions in R is fairly trivial using rnorm(), rexp(), etc, if you know what distribution you want to use, as well as its parameters. For example, rnorm(10, mean=5, sd=2) returns 10 draws from a normal distribution with mean 5 and sd 2.
rnorm(10, mean = 5, sd = 2)
## [1] 5.373151 7.970897 6.933788 5.455081 6.346129 5.767204 3.847219 7.477896 5.860069 6.154341
## or here's a histogram of 10000 draws...
hist(rnorm(10000, 5, 2))
You might be interested in an exponential distribution - check out hist(rexp(10000, rate=1)) to get the idea.
The easiest solution will be to investigate what probability distribution(s) you're interested in and their implementation in R.
It is still possible to return random draws from some custom function, and there are a few techniques out there for doing it - but it might get messy. Here's a VERY rough implementation of drawing randomly from probabilities defined by the region of x^3 - 3x^2 + 4 between zero and 3.
## first a vector of random uniform draws from the region
unifdraws <- runif(10000, 0, 3)
## assign a probability of "keeping" draws based on scaled probability
pkeep <- (unifdraws^3 - 3*unifdraws^2 + 4)/4
## randomly keep observations based on this probability
keep <- rbinom(10000, size=1, p=pkeep)
draws <- unifdraws[keep==1]
## and there it is!
hist(draws)
## of course, it's less than 10000 now, because we rejected some
length(draws)
## [1] 4364

Interpreting description of data generating process

I am trying to generate monthly stock data using a one-factor model:
$$R_{a,t} = \alpha + B*R_{b,t}+\epsilon_{t}$$
The description says:
$R_{a,t}$ is the excess asset returns vector, $\alpha$ is the mispricing coefficients vector, $B$ is the factor loadings matrix, $R_{b,t}$ is the vector of excess returns on the factor portfolios, $R_{b}-N(\mu_{b},\sigma_{b})$, and $\epsilon_{t}$ is the vector of noise, $\epsilon - N(0,\sum_{e})$, which is independent with respect to the factor portfolios.
For our simulations, we assume that the risk-free rate follows a normal distribution, with an annual average of 2% and a standard deviation of 2%. We assume that there is only one factor (K=1), whose annual excess return has an annual average of 8% and a standard deviation of 16%. The mispricing $\alpha$ is set to zero and the factor loadings, B, are evenly spread between 0.5 and 1.5. Finally, the variance-covariance matrix of noise, $\sum_{\epsilon}$, is assumed to be diagonal, with elements drawn from a uniform distribution with support [0.10,0.30], so that the cross-sectional average annual idiosyncratic volatility is 20%.
Using the information provided here I try to generate the data:
alpha <- 0 #mispricing index is set to 0
B <- matrix(runif(1000,min=0.5,max=1),100,10) #factor loadings matrix is evenly spread between 0.5 and 1.5
R <- rnorm(100,mean=8/12,sd=16/sqrt(12)) #factor with annual excess return of 8% and standard deviation of 16%
epsilon <- rnorm(100, mean=0,sd=runif(10,min=0.1,max=0.30)) #error term with mean 0 and standard deviation drawn from a uniform distribtion
Then I generate the data:
data <- alpha + B*R + epsilon
My question is: am I interpreting this description correctly?

Gamma equivalent to standard deviations

I have a gamma distribution fit to my data using libary(fitdistrplus). I need to determine a method for defining the range of x values that can be "reasonably" expected, analogous to using standard deviations with normal distributions.
For example, x values within two standard deviations from the mean could be considered to be the reasonable range of expected values from a normal distribution. Any suggestions for how to define a similar range of expected values based on the shape and rate parameters of a gamma distribution?
...maybe something like identifying the two values of x that between which contains 95% of the data?
Let's assume we have a random variable that is gamma distributed with shape alpha=2 and rate beta=3. We would expect this distribution to have mean 2/3 and standard deviation sqrt(2)/3, and indeed we see this in simulated data:
mean(rgamma(100000, 2, 3))
# [1] 0.6667945
sd(rgamma(100000, 2, 3))
# [1] 0.4710581
sqrt(2) / 3
# [1] 0.4714045
It would be pretty weird to define confidence ranges as [mean - gamma*sd, mean + gamma*sd]. To see why, consider if we selected gamma=2 in the example above. This would yield confidence range [-0.276, 1.609], but the gamma distribution can't even take on negative values, and 4.7% of data falls above 1.609. This is at the very least not a well balanced confidence interval.
A more natural choice might by to take the 0.025 and 0.975 percentiles of the distribution as a confidence range. We would expect 2.5% of data to fall below this range and 2.5% of data to fall above the range. We can use qgamma to determine that for our example parameters the confidence range would be [0.081, 1.857].
qgamma(c(0.025, 0.975), 2, 3)
# [1] 0.08073643 1.85721446
The mean expected value of a gamma is:
E[X] = k * theta
The variance is Var[X] = k * theta^2 where, k is shape and theta is scale.
But typically I would use 95% quantiles to indicate data spread.

R: Function that finds the range of 95% of all values?

Is there a function or an elegant way in the R language, to get the minimum range, that covers, say 95% of all values in a vector?
Any suggestions are very welcome :)
95% of the data will fall between the 2.5th percentile and 97.5th percentile. You can compute that value in R as follows:
x <- runif(100)
quantile(x,probs=c(.025,.975))
To get a sense of what's going on, here's a plot:
qts <- quantile(x,probs=c(.05,.95))
hist(x)
abline(v=qts[1],col="red")
abline(v=qts[2],col="red")
Note this is the exact/empirical 95% interval; there's no normality assumption.
It's not so hard to write such function:
find_cover_region <- function(x, alpha=0.95) {
n <- length(x)
x <- sort(x)
k <- as.integer(round((1-alpha) * n))
i <- which.min(x[seq.int(n-k, n)] - x[seq_len(k+1L)])
c(x[i], x[n-k+i-1L])
}
Function will find shortest interval. If there are intervals with the same length first (from -Inf) will be picked up.
find_cover_region(1:100, 0.70)
# [1] 1 70
find_cover_region(rnorm(10000), 0.9973) # three sigma, approx (-3,3)
# [1] -2.859 3.160 # results may differ
You could also look on highest density regions (e.g. in package hdrcde, function hdr). It's more statistical way to find shortest intervals with given cover probability (some kernel density estimators are involved).
The emp.hpd function in the TeachingDemos package will find the values in a vector that enclose a given percentage of the data (95%) that also give the shortest range between the values. If the data is roughly symmetric then this will be close to the results of using quantile, but if the data are skewed then this will give a shorter range.
If the values are distributed approximately like the normal distribution, you can use the standard deviation. First, calculate the mean µ and standard deviation of the distribution. 95% of the values will be in the interval of (µ - 1.960 * stdev, µ + 1.960 * stdev).

Resources