Normal distribution in R (what values for mean and sd?) - r

I have to make a normal distribution from a set of pre-established data, henceforth xvec. So, I know I need to use dnorm(xvec,meanvec,sdvec). But what values I put for mean and sd? Can I put always meanvec = mean(xvec) and sdvec = sd(xvec)? Is it a reasonable way? Or is it preferable let the default values of mean=0 and sd=1?
I'm asking this because I looked some examples and the values for mean and sd alwayes were chosen before. For example, this one, from https://www.tutorialspoint.com/r/r_normal_distribution.htm:
Create a sequence of numbers between -10 and 10 incrementing by 0.1.
x <- seq(-10, 10, by = .1)
Choose the mean as 2.5 and standard deviation as 0.5.
y <- dnorm(x, mean = 2.5, sd = 0.5)
Why did he put mean=2.5 and sd=0.5, once
> mean(x)
[1] 5.105265e-16
> sd(x)
[1] 5.816786?

Related

Generate random decimal numbers with given mean in given range in R

Hey I want to generate 100 decimal numbers in the range of 10 and 50 with the mean of 32.2.
I can use this to generate the numbers in the wanted range, but I don't get the mean:
runif(100, min=10, max=50)
Or I could use this and I dont get the range:
rnorm(100,mean=32.2,sd=10)
How can I combine those two or can I use another function?
I have tried to use this approach:
R - random distribution with predefined min, max, mean, and sd values
But I dont get the exact mean I want... (31.7 in my example try)
n <- 100
y <- rgbeta(n, mean = 32.2, var = 200, min = 10, max = 50)
Edit: Ok i have lowered the var and the mean gets near to 32.2 but I still want some values near the min and max range...
In order to get random numbers between 10 and 50 with a (true) mean of 32.2, you would need a density function that would fulfill those properties.
A uniform distribution with a min of 10 and a max of 50 (runif) will never deliver you that mean, as the true mean is 30 for that distribution.
The normal distribution has a range from - infinity to infinity, independent of the mean it has, so runif will return numbers greater than 50 and smaller than 10.
You could use a truncated normal distribution
rnormTrunc(n = 100, mean = 32.2, sd = 1, min = 10, max = 50),
if that distribution would be okay. If you need a different distibution, things will get a little more complicated.
Edit: feel free to ask if you need the math behind that, but depending on what your density function should look like it will get very complicated
This isn't perfect, but maybe its a start. I can't get the range to work out perfectly, so I just played with the "max" until I got an output I was happy with. There is probably a more solid math way to do this. The result is uniform-adjacent... at best...
rand_unif_constrained <- function(num, min, max, mean) {
vec <- runif(num, min, max)
vec / sum(vec) * mean*num
}
set.seed(35)
test <- rand_unif_constrained(100, 10, 40, 32.2) #play with max until max output is less that 50
mean(test)
#> [1] 32.2
min(test)
#> [1] 12.48274
max(test)
#> [1] 48.345
hist(test)

Law of large numbers

Could someone help me, please, thank you !
I can only do this, am I doing it wrong?
rm(list=ls())
a = runif(1000,0,1)
b = pnorm(a, mean = 60.5, sd = 0.1)
mean = rep(1,1000)
for(i in 1:1000){
mean[i] = mean(rexp(b,2))
}
n = seq(1, 1000)
plot(mean ~ n)
1 000 numbers 𝑋 ~ 𝑈(𝑎, 𝑏) distribution
Then calculate mean from first, first two, first three..., thousand of these random numbers and means and absolute value.
Your mistake here was using the probability norm pnorm instead of the quantile norm qnorm. You also use rexp when you can be using the mean function to find the means of the values within your normal distribution b.
rm(list=ls())
a=runif(1000,0,1)
b=qnorm(a,mean=60.5,sd = 0.1)
avg= rep(1,1000)
for(i in 1:1000){
avg[i] = mean(b[1:i])
}
n=seq(1,1000)
plot(avg~n)
To create a chart of the absolute residual between the calculated average you can simply subtract 60.5 by avg, take its absolute value, and plot that.
residual = abs(60.5 - avg)
plot(residual~n)
I'd also recommend using avg in place of mean, as mean is already the name of a function within R.

How to Standardize a Column of Data in R and Get Bell Curve Histogram to fins a percentage that falls within a ranges?

I have a data set and one of columns contains random numbers raging form 300 to 400. I'm trying to find what proportion of this column in between 320 and 350 using R. To my understanding, I need to standardize this data and creates a bell curve first. I have the mean and standard deviation but when I do (X - mean)/SD and get histogram from this column it's still not a bell curve.
This the code I tried.
myData$C1 <- (myData$C1 - C1_mean) / C1_SD
If you are simply counting the number of observations in that range, there's no need to do any standardization and you may directly use
mean(myData$C1 >= 320 & myData$C1 <= 350)
As for the standardization, it definitely doesn't create any "bell curves": it only shifts the distribution (centering) and rescales the data (dividing by the standard deviation). Other than that, the shape itself of the density function remains the same.
For instance,
x <- c(rnorm(100, mean = 300, sd = 20), rnorm(100, mean = 400, sd = 20))
mean(x >= 320 & x <= 350)
# [1] 0.065
hist(x)
hist((x - mean(x)) / sd(x))
I suspect that what you are looking for is an estimate of the true, unobserved proportion. The standardization procedure then would be applicable if you had to use tabulated values of the standard normal distribution function. However, in R we may do that without anything like that. In particular,
pnorm(350, mean = mean(x), sd = sd(x)) - pnorm(320, mean = mean(x), sd = sd(x))
# [1] 0.2091931
That's the probability P(320 <= X <= 350), where X is normally distributed with mean mean(x) and standard deviation sd(x). The figure is quite different from that above since we misspecified the underlying distribution by assuming it to be normal; it actually is a mixture of two normal distributions.

Problem with creating a lot of new vector's

I want to do some things :
Draw 100 times 50 number's from normal distribution with
mean = 10 and standard deviation = 20
For any draw i want to count his standard deviation and arithmetic mean.
At the end i want to create a vector which has a length 100, containing the absolute value of the difference of the standard deviation and the arithmetic mean. (i.e i want to create some vector x that x[i]=|a-b|, where a is the standard deviation of 100 numbers in i-th draw, and b is the mean of 100 number's in i-th draw.
What i Did :
Creating 100 draw's from normal distribution above :
replicate(100, rnorm(50, 10, 20), simplify = FALSE)
Now i have a problem. I know that i can use functions "mean" and "sd" to count arithmetic mean and standard deviation, but i have to define number's that i draw as a vector. What i mean :
Number's that i rolled in first draw - vector 1
Number's that i rolled in second draw - vector 2
And so on
Then i can count their arithmetic mean and standard deviation.
Then we can count |a-b| (define above). And at the end i will create the vector that x[i]=|a-b|.
I have an idea but i don't know how to write it.
This is a matter of assigning the result of replicate to a variable (of class "list", since simplify = FALSE) and then sapply the mean and sd functions.
set.seed(1234) # Make the results reproducible
repl <- replicate(100, rnorm(50, 10, 20), simplify = FALSE)
mu <- sapply(repl, mean)
s <- sapply(repl, sd)
D <- abs(s - mu)
head(D)
#[1] 16.761930 7.953432 6.833691 12.491605 5.490149 6.850794
A one-liner could be
D2 <- sapply(repl, function(x) abs(sd(x) - mean(x)))
identical(D, D2)
#[1] TRUE

Generate a matrix with certain values such that its standard deviation is 1?

I'm currently going through an 'Introduction to R' book and I am completely stuck at the following question:
Create a 5x5 matrix (M), all its entries drawn from the uniform distribution, with sd 1 and mean being the column number of the element. (so mean(matrix[,I]) == column(i), sd(matrix) == 1)
I have to make use of the sapply() function.
I was thinking about something like this:
m <- matrix(runif(25), nrow = 5, ncol = 50
sapply(matrix, function(x) sd(x) == 1)
But that part already doesn't work and I'm just stuck.
Help would be appreciated!
The mean can be set by the following:
my_uniform <- function(col_nbr) {
runif(5, min = col_nbr-sqrt(12)/2, max=col_nbr+sqrt(12)/2)
}
M <- sapply(1:5, my_uniform)
This will lead to std=1 for each column and the mean is set to the number of column in each column. The formular for mean is:
The formular for the sdt is:
From the random uniform distribution one can only simulate values between a range with the same probability each one, being the expected mean when n goes to infinity to be the mean value between the min and the max.
From the point of view of a uniform distribution, the mean and the standard deviation cannot be defined in the function. What you can do is simulate such that the middle value (i.e. the mean) would be the number you are expecting, but the standard deviation would not be 1:
set.seed(1)
numrow<-5
numcol<-5
Mat<-matrix(NA, nrow = numrow, ncol = numcol)
for(i in 1:numcol){
Mat[,i]<- runif(numrow, min = i-0.5, max = i+0.5)
}
Mat
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.7655087 2.398390 2.705975 3.997699 5.434705
# [2,] 0.8721239 2.444675 2.676557 4.217619 4.712143
# [3,] 1.0728534 2.160798 3.187023 4.491906 5.151674
# [4,] 1.4082078 2.129114 2.884104 3.880035 4.625555
# [5,] 0.7016819 1.561786 3.269841 4.277445 4.767221
To see the formulas of the expected mean and expected variance (therefore the standard deviation) I refer to https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)
This should now be the correct way to define the uniform distribution. If the mean is defined as mean=0.5*(a+b) then defining the upper limit like this will result in a mean of the column number.
sapply(1:5, function(x){runif(5, min = 0, max = x*2)})
See this little MonteCarlo experiment:
mean(runif(50000, min = 0, max = 1*2))
You gotta find the pdf ranges (a, b) that fit each mean, sd pair first. The mean of a uniform dist is
mu <- (b + a) / 2 The mu values are indexed from 1:5.
The sd of a uniform dist is (b - a) / sqrt(12)
The sd is fixed at 1, so use the sd equation to solve for b.
Then plug in b in the mu equation to solve for a
Now you have the a, b parameters of the uniform dist
The sapply function then looks like this:
z <- sapply(1:5, function(x) runif(5, 2*x - (2*x + sqrt(12)/2), (2*x + sqrt(12)/2)))
Run summary(z) will give you the output stats. Because of the small sample size the sample means will be off. To test, change the runif sample size from 5 to 100000. Then run summary(z) again. You will see that the values converge to the index means.

Resources