How can I plot a skewed normal distribution in R, given the number of cases, the mean, standard deviation, median and the MAD.
A example would be that I have 1'196 cases, were the mean cost is 6'389, the standard deviation 5'158, the median 4'930 and the MAD 1'366. And we know that the billed case always cost something, so the cost must always be positive.
The best answer to this problem I could find is from https://math.stackexchange.com/a/17995/54064 and recommends the usage of the sn package. However I could not figure out how to use it for my concrete use case.
I've had some success with fGarch package.
require("fGarch")
hist(rsnorm(1000, mean = 0, sd = 1, xi = 15))
mmm <- replicate(300, {
x <- rsnorm(1196, mean = 6389, sd = 5158, xi = 15)
c(mean = mean(x), sd = sd(x))
})
> mean(mmm[1, ])
[1] 6404.312
> mean(mmm[2, ])
[1] 5169.572
Related
I have no sample and I'd like to compute the variance, mean, median, and mode of a distribution which I only have a vector with it's density and a vector with it's support. Is there an easy way to compute this statistics in R with this information?
Suppose that I only have the following information:
Support
Density
sum(Density) == 1 #TRUE
length(Support)==length(Density)# TRUE
You have to do weighted summations
F.e., starting with #Johann example
set.seed(312345)
x = rnorm(1000, mean=10, sd=1)
x_support = density(x)$x
x_density = density(x)$y
plot(x_support, x_density)
mean(x)
prints
[1] 10.00558
and what, I believe, you're looking for
m = weighted.mean(x_support, x_density)
computes mean as weighted mean of values, producing output
10.0055796130192
There are weighted.sd, weighted.sum functions which should help you with other quantities you're looking for.
Plot
If you don't need a mathematical solution, and an empirical one is all right, you can achieve a pretty good approximation by sampling.
Let's generate some data:
set.seed(6854684)
x = rnorm(50,mean=10,sd=1)
x_support = density(x)$x
x_density = density(x)$y
# see our example:
plot(x_support, x_density )
# the real mean of x
mean(x)
Now to 'reverse' the process we generate a large sample from that density distribution:
x_sampled = sample(x = x_support, 1000000, replace = T, prob = x_density)
# get the statistics
mean(x_sampled)
median(x_sampled)
var(x_sampled)
etc...
I have to make a normal distribution from a set of pre-established data, henceforth xvec. So, I know I need to use dnorm(xvec,meanvec,sdvec). But what values I put for mean and sd? Can I put always meanvec = mean(xvec) and sdvec = sd(xvec)? Is it a reasonable way? Or is it preferable let the default values of mean=0 and sd=1?
I'm asking this because I looked some examples and the values for mean and sd alwayes were chosen before. For example, this one, from https://www.tutorialspoint.com/r/r_normal_distribution.htm:
Create a sequence of numbers between -10 and 10 incrementing by 0.1.
x <- seq(-10, 10, by = .1)
Choose the mean as 2.5 and standard deviation as 0.5.
y <- dnorm(x, mean = 2.5, sd = 0.5)
Why did he put mean=2.5 and sd=0.5, once
> mean(x)
[1] 5.105265e-16
> sd(x)
[1] 5.816786?
I am new to R and looking to estimate the likelihood of having an outcome>=100 using a probability density function (the outcome in my example is the size of an outbreak). I believe I have the correct coding, but something doesn't feel right about the answer, when I look at the plot.
This is my code (it's based on the output of a stochastic model of an outbreak). I'd very much appreciate pointers. I think the error is in the likelihood calculation....
Thank you!
total_cases.dist <- dlnorm(sample.range, mean = total_cases.mean, sd = total_cases.sd)
total_cases.df <- data.frame("total_cases" = sample.range, "Density" = total_cases.dist)
library(ggplot2)
ggplot(total_cases.df, aes(x = total_cases, y = Density)) + geom_point()
pp <- function(x) {
print(paste0(round(x * 100, 3), "%"))
}
# likelihood of n_cases >= 100
pp(sum(total_cases.df$Density[total_cases.df$total_cases >= 100]))
You are using dlnorm, which is the log-normal distribution, which means the mean and sd are the mean of the log (values) and sd of log(values), for example:
# we call the standard rlnorm
X = rlnorm(1000,0,1)
# gives something close to sd = exp(1), and mean=something
c(mean(X),sd(X))
# gives what we simulated
c(mean(log(X)),sd(log(X)))
We now simulate some data, using a known poisson distribution where mean = variance. And we can model it using the log-normal:
set.seed(100)
X <- rpois(500,lambda=1310)
# we need to log values first
total_cases.mean <- mean(log(X))
total_cases.sd <- sd(log(X))
and you can see it works well
sample.range <- 1200:1400
hist(X,br=50,freq=FALSE)
lines(sample.range,
dlnorm(sample.range,mean=total_cases.mean,sd=total_cases.sd),
col="navyblue")
For your example, you can get probability of values > 1200 (see histogram):
plnorm(1200,total_cases.mean,total_cases.sd,lower.tail=FALSE)
Now for your data, if it is true that mean = 1310.198 and total_cases.sd = 31615.26, take makes variance ~ 76000X of your mean ! I am not sure then if the log normal distribution is appropriate for modeling this kind of data..
I have a data set and one of columns contains random numbers raging form 300 to 400. I'm trying to find what proportion of this column in between 320 and 350 using R. To my understanding, I need to standardize this data and creates a bell curve first. I have the mean and standard deviation but when I do (X - mean)/SD and get histogram from this column it's still not a bell curve.
This the code I tried.
myData$C1 <- (myData$C1 - C1_mean) / C1_SD
If you are simply counting the number of observations in that range, there's no need to do any standardization and you may directly use
mean(myData$C1 >= 320 & myData$C1 <= 350)
As for the standardization, it definitely doesn't create any "bell curves": it only shifts the distribution (centering) and rescales the data (dividing by the standard deviation). Other than that, the shape itself of the density function remains the same.
For instance,
x <- c(rnorm(100, mean = 300, sd = 20), rnorm(100, mean = 400, sd = 20))
mean(x >= 320 & x <= 350)
# [1] 0.065
hist(x)
hist((x - mean(x)) / sd(x))
I suspect that what you are looking for is an estimate of the true, unobserved proportion. The standardization procedure then would be applicable if you had to use tabulated values of the standard normal distribution function. However, in R we may do that without anything like that. In particular,
pnorm(350, mean = mean(x), sd = sd(x)) - pnorm(320, mean = mean(x), sd = sd(x))
# [1] 0.2091931
That's the probability P(320 <= X <= 350), where X is normally distributed with mean mean(x) and standard deviation sd(x). The figure is quite different from that above since we misspecified the underlying distribution by assuming it to be normal; it actually is a mixture of two normal distributions.
I want to generate a random distribution of say 10,000 numbers with predefined min, max, mean, and sd values. I have followed this link setting upper and lower limits in rnorm to get random distribution with fixed min and max values. However, in doing so, mean value changes.
For example,
#Function to generate values between a lower limit and an upper limit.
mysamp <- function(n, m, s, lwr, upr, nnorm) {
set.seed(1)
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}
stop(simpleError("Not enough values to sample from. Try increasing nnorm."))
}
Account_Value <- mysamp(n=10000, m=1250000, s=4500000, lwr=50000, upr=5000000, nnorm=1000000)
summary(Account_Value)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 50060 1231000 2334000 2410000 3582000 5000000
#Note - though min and max values are good, mean value is very skewed for an obvious reason.
# sd(Account_Value) # 1397349
I am not sure whether we can generate a random normal distribution that meets all conditions. If there is any other sort of random distribution that can meet all conditions, please do share too.
Look forward to your inputs.
-Thank you.
You could use a generalized form of the beta distribution, known as the Pearson type I distribution. The standard beta distribution is defined on the interval (0,1), but you can take a linear transformation of a standard beta distributed variable to obtain values between any (min, max). The answer to this question on CrossValidated explains how to parameterize a beta distribution with its mean and variance, with certain constraints.
While it's possible to formulate both a truncated normal and a generalized beta distribution with the desired min, max, mean and sd, the shape of the two distributions will be very different. This is because the truncated normal distribution has a positive probability density at the endpoints of its support interval, while in the generalized beta distribution the density will always fall smoothly to zero at the endpoints. Which shape is more preferable will depend on your intended application.
Here's an implementation in R for generating generalized beta distributed observations with a mean, variance, min and max parameterization.
rgbeta <- function(n, mean, var, min = 0, max = 1)
{
dmin <- mean - min
dmax <- max - mean
if (dmin <= 0 || dmax <= 0)
{
stop(paste("mean must be between min =", min, "and max =", max))
}
if (var >= dmin * dmax)
{
stop(paste("var must be less than (mean - min) * (max - mean) =", dmin * dmax))
}
# mean and variance of the standard beta distributed variable
mx <- (mean - min) / (max - min)
vx <- var / (max - min)^2
# find the corresponding alpha-beta parameterization
a <- ((1 - mx) / vx - 1 / mx) * mx^2
b <- a * (1 / mx - 1)
# generate standard beta observations and transform
x <- rbeta(n, a, b)
y <- (max - min) * x + min
return(y)
}
set.seed(1)
n <- 10000
y <- rgbeta(n, mean = 1, var = 4, min = -4, max = 5)
sapply(list(mean, sd, min, max), function(f) f(y))
# [1] 0.9921269 2.0154131 -3.8653859 4.9838290
Discussion:
Hi. It is very interesting problem. It needs quite an effort to be solved properly and not always solution can be found.
First thing is that when you truncate a distribution (set a min and max for it) standard deviation is limited (has a maximum depending on min and max values). If you want too big value of it - you can not get it.
Second restriction limits mean. It is obvious that if you want mean below minimum and above maximum it will not work, but you may want something too close to limits and still it can not be satisfied.
Third restriction limits a combination of this parameters. Im not sure how does it work, but i am pretty sure not all the combinations may be satisfied.
But there are some combinations that may work and may be found.
Solution:
The problem is: what are the parameters: mean and sd of truncated (cut) distribution with defined limits a and b, so in the end the mean will be equal to desired_mean and standard deviation will be equal to desired_sd.
It is important that values of parameters: mean and sd are used before truncation. So that is why in the end mean and deviation are diffrent.
Below is the code that solves the problem using function optim(). It may not be the best solution for this problem, but it generally works:
require(truncnorm)
eval_function <- function(mean_sd){
mean <- mean_sd[1]
sd <- mean_sd[2]
sample <- rtruncnorm(n = n, a = a, b = b, mean = mean, sd = sd)
mean_diff <-abs((desired_mean - mean(sample))/desired_mean)
sd_diff <- abs((desired_sd - sd(sample))/desired_sd)
mean_diff + sd_diff
}
n = 1000
a <- 1
b <- 6
desired_mean <- 3
desired_sd <- 1
set.seed(1)
o <- optim(c(desired_mean, desired_sd), eval_function)
new_n <- 10000
your_sample <- rtruncnorm(n = new_n, a = a, b = b, mean = o$par[1], sd = o$par[2])
mean(your_sample)
sd(your_sample)
min(your_sample)
max(your_sample)
eval_function(c(o$par[1], o$par[2]))
I am very interested if there are other solutions to that problem, so please post them if you find other answers.
EDIT:
#Mikko Marttila: Thanks to your comment and link: Wikipedia I implemented formulas to calculate mean and sd of truncated distribution. Now the solution is WAY more elegant and it should calculate quite accurately mean and sd of the desired distribution if they exist. It works much faster also.
I implemented eval_function2 which should be used in the optim() function instead of previous one:
eval_function2 <- function(mean_sd){
mean <- mean_sd[1]
sd <- mean_sd[2]
alpha <- (a - mean)/sd
betta <- (b - mean)/sd
trunc_mean <- mean + sd * (dnorm(alpha, 0, 1) - dnorm(betta, 0, 1)) /
(pnorm(betta, 0, 1) - pnorm(alpha, 0, 1))
trunc_var <- (sd ^ 2) *
(1 +
(alpha * dnorm(alpha, 0, 1) - betta * dnorm(betta, 0, 1))/
(pnorm(betta, 0, 1) - pnorm(alpha, 0, 1)) -
(dnorm(alpha, 0, 1) - dnorm(betta, 0, 1))/
(pnorm(betta, 0, 1) - pnorm(alpha, 0, 1)))
trunc_sd <- trunc_var ^ 0.5
mean_diff <-abs((desired_mean - trunc_mean)/desired_mean)
sd_diff <- abs((desired_sd - trunc_sd)/desired_sd)
}