Lognormal truncated distribution with R, random values - r

I need to generate random values that represent times (in seconds) that follow a lognormal distribution with:
Min: 120 seconds
Max: 1260 seconds
Mean: 356 seconds
SD: 98 seconds
I am generating 100 random numbers:
library(EnvStats)
sample1 <- rlnormTrunc(100,356,98,120,1260)
and when I calculate the mean, it is not 356, but higher, about 490 seconds. Why?
I don't understand what I am doing wrong as I though I was going to get the same mean.
Does anyone has an answer for this?

The reason is that you compare different distributions, so when you create random numbers out of these distributions, their mean is different.
If we take as an example the normal distribution then
set.seed(111)
sample1 <- rnorm(n=10000,mean=356,sd=98)
mean(sample1) #355.7724
the mean would indeed almost 356. But if we took the truncated Normal Distribution then
set.seed(111)
sample2<-rnormTrunc(n=100000,mean=356,sd=98,min=120 ,max=1260)
mean(sample2) #357.9636
the mean would be slightly different, around 358 but not 356. The reason why the difference is so small is because, as seen in the histogram
hist(rnorm(n=10000,mean=356,sd=98),breaks=100,xlim=c(0,1300))
abline(v=120,col="red")
abline(v=1260,col="red")
enter image description here
by truncating, you take out very infrequent values ( smaller than 120 and bigger than 1260).
LogNormal is a fat-tailed distribution, skewed to the right. This means that it includes far more infrequent values than the normal distribution, far beyond 1260. If you truncate the distribution between 120 and 1260
hist(rlnormTrunc(10000,meanlog=356,sdlog=98,min=120,max=1260),breaks=100)
you get
set.seed(111)
mean(rlnormTrunc(10000,meanlog=356,sdlog=98,min=120,max=1260)) #493.3903
enter image description here
In each of the examples above you calculate the mean for a random set of values of different ranges because of different distributions, that's why you end up with different mean values.

Related

How can I create a normal distributed set of data in R?

I'm a newbie in statistics and I'm studying R.
I decided to do this exercise to pratice some analysis with an original dataset.
This is the issue: I want to create a datset of let's say 100 subjects and for each one of them I have a test score.
This test score has a range that goes from 0 to 70 and the mean score is 48 (and its improbable that someone scores 0).
Firstly I tried to create the set with x <- round(runif(100, min=0, max=70)) , but then I found out that were not normally distributed using plot(x).
So I searched another Rcommand and found this, but I couldn't decide the min\max:
ex1 <- round(rnorm(100, mean=48 , sd=5))
I really can't understand what I have to do!
I would like to write a function that gives me a set of data normally distributed, in a range of 0-70, with a mean of 48 and a not so big standard deviation in order to do some T-test later...
Any help?
Thanks a lot in advance guys
The normal distribution, by definition, does not have a min or max. If you go more than a few standard deviations from the mean, the probability density is very small, but not 0. You can truncate a normal distribution, chopping of the tails. Here, I use pmin and pmax to set any values below 0 to 0, and any values above 70 to 70:
ex1 <- round(rnorm(100, mean=48 , sd=5))
ex1 <- pmin(ex1, 70)
ex1 <- pmax(ex1, 0)
You can calculate the probability of an individual observation being below or above a certain point using pnorm. For your mean of 48 and SD of 5, the probability an individual observation is less than 0 is very small:
pnorm(0, mean = 48, sd = 5)
# [1] 3.997221e-22
This probability is so small that the truncation step is unnecessary in most applications. But if you started experimenting with bigger standard deviations, or mean values closer to the bounds, it could become necessary.
This method of truncation is simple, but it is a bit of a hack. If you truncated a distribution to be within 1 SD of the mean using this method, you would end up with spikes a the upper and lower bound that are even higher than the density at the mean! But it should work well enough for less extreme applications. A more robust method might be to draw more samples than you need, and keep the first n samples that fall within your bounds. If you really care to do things right, there are packages that implement truncated normal distributions.
(Because the normal distribution is symmetric, and 100 is farther from your mean than 0, the probability of observations > 100 are even smaller.)

Weighted Likelihood of an Event Occurring

I want to identify the probability of certain events occurring for a range.
Min = 600 Max = 50,000 Most frequent outcome = 600
I generated a sequence of events: numbers <- seq(600,50000,by=1)
This is where I get stuck. Not sure if using the wrong distribution or attempt at execution is going down the wrong path.
qpois(numbers, lambda = 600) produces NaNs
So the outcome desired is to be able to get an output of weighted probabilities (weighted to the mean of 600). And then be able to assess the likelihood of an outlier event about 30000 is 5% or different cuts like that by summing the probabilities for those numbers.
A bit rusty, haven't used this for a few years so any online resources to refresh is also appreciated!
Firstly, I think you're looking for ppois rather than qpois. The function qpois(p, 600) takes a vector p of probabilities. If you do qpois(0.75, 600) you will get 616, meaning that 75% of observations will be at or below 616.
ppois is the opposite of qpois. If you do ppois(616, 600) you will get (approximately) 0.75.
As for your specific distribution, it can't be a Poisson distribution. Let's see what a Poisson distribution with a mean of 600 looks like:
x <- 500:700
plot(x, dpois(x, 600), type = "h")
Getting a value of greater than even 900 has (essentially) a zero probability:
1 - ppois(900, 600)
#> [1] 0
So if your data contains values of 30,000 or 50,000 as well as 600, it's certainly not a Poisson distribution.
Without knowing more about your actual data, it's not really possible to say what distribution you have. Perhaps if you include a sample of it in your question we could be of more help.
EDIT
With the sample of numbers provided in the comments, we can have a look at the actual empirical distribution:
hist(numbers, 200)
and if we want to know the probability at any point, we can create the empirical cumulative distribution function like this:
get_probability_of <- ecdf(numbers)
This allows us to do:
number <- 1:50000
plot(number, get_probability_of(number), ylab = "probability", type = "l")
and
get_probability_of(30000)
#> [1] 0.83588
Which means that the probability of getting a number higher than 30,000 is
1 - get_probability_of(30000)
#> [1] 0.16412
However, in this case, we know how the distribution is generated, so we can calculate the exact theoretical cdf just using some simple geometry (I won't show my working here because although it is simple, it is rather lengthy, dull, and not applicable to other distributions):
cdf <- function(x) ifelse(x < 600, 0, 1 - ((49400 - (x - 600)) / 49400)^2)
and
cdf(30000)
#> [1] 0.8360898
which is very close to, but more theoretically accurate than the empirical value.

Calculating a probability that a random number will be between two numbers from a data set

I have generated a random, normally distributed population of data that has a mean of 341.08 and a standard deviation of 3.07. Here's that code:
pop <- rnorm(1000, mean=341.08,sd=3.07)
I need to find out the probability that a random number picked will fall between 337 and 343 (both numbers included). How would I execute this?
This will tabulate that vector using the bounds you set:
table(cut(pop, c(-Inf,337,343,Inf) ))
(-Inf,337] (337,343] (343, Inf]
87 645 268
So the fraction of values (which is also the probability) is:
table(cut(pop, c(-Inf,337,343,Inf) ))[2]/length(pop)
(337,343]
0.645
To make this reproducible you would use set.seed().
And to refine the estimate, if this is being asked as a theoretical question, you might either simulate it with replicate, or use:
pnorm(343, 341.08, 3.07)-pnorm(337, 341.08, 3.07)
[1] 0.6422225
The first method needs only the data. The other two methods would require knowing that the data came from a Normal distribution with the specified parameters.

R 1000 rnorm samples

x <- rnorm(25) will produce a single sample of size 25 from the standard normal distribution.
How do I take 1000 samples of size 25 from standard normal distribution at the same time?
I would like to do this efficiently, so that I will be able to compute things such as the mean and standard deviation for each of the 1000 samples and compare them via a histogram.
[Also: I would then like to uniformally and randomly select one of these 1000 samples and bootstrap it]
X <- matrix(rnorm(25000), 1000, 25)
Each row of X is a sample of size 25 from the standard normal distribution. There are 1000 rows.

How to find the representative standard deviation from a pool of standard deviation value for a repeated process?

I have a process A that captures 86400 sample points per day from a system B. I am repeating the process A for 23 days. After 23 days, I have 23 mean and 23 standard deviation (sd) values. I am trying to come up with a normal distribution for this entire process. For constructing a normal distribution, I need a representative mean and standard deviation value. For representative mean, I can take the average of all 23 means, but I am not sure what will be the representative for 23 standard deviations.
Is it right to consider average of all standard deviation values as the representative standard deviation for the entire process?
All the 86400 samples points are random numbers between 0 and 20.
It is unclear what you mean by "trying to come up with a normal distribution for this entire process", but I hope this helps:
You have a list means, to get the representative mean, you are doing the right thing by taking the average of them. To get the representative of standard deviation, take the 23 means that you calculated and find the standard deviation of them as a data set. Below is some R code I hope you can translate to fit your needs.
data <- processA_runFor23Days()
daily_means <- getMeanForEachDay(data) #this should be a vector of length 23
sd(daily_means)
Where "daily_means" are the means for each day. I think this should be ok since each day has the same number of data points.
EDIT:
To be more clear, lets say that you have the means for each of the 23 days
> daily_means
[1] 0.59073346 0.66107694 0.32187724 0.60259824 0.92803502 0.82414235
[7] 0.21125403 0.61161841 0.48346220 0.86058580 0.87253787 0.94609922
[13] 0.40849556 0.96766218 0.49403126 0.38261995 0.02554012 0.19892710
[19] 0.55517159 0.71836176 0.53599262 0.67525105 0.25059165
Ignore the the standard deviations for each of the days, they no longer matter. Your new distribution is now the the means from each day. So take the mean and the standard deviation of these 23 numbers.
> mean(daily_means)
[1] 0.5707246
> sd(daily_means)
[1] 0.2624342

Resources