I'm working on a task where we look at 12 independent and identically distributed random variables - each of which have standard normal distribution.
From that I understand we have a mean of 0 and sd of 1.
We then have an interval of (-1.644, 1.644)
To find the probability of a single random variable landing in this interval I write:
(pnorm(1.644, mean = 0, sd = 1, lower.tail=TRUE) - pnorm(-1.644, mean = 0, sd = 1, lower.tail=TRUE))
Which returns the Probability of 0.8998238
I'm able to find the probability of at least one of the 12 random variables landing outside of the interval (-1.644, 1.644) with the following:
PROB_1 = 1-(0.8998238^12))
#PROB_1 = 0.7182333
However - How would if find the probability of Exactly 2 random variables landing outside of the interval? I've attempted the following:
((12*11)/2)*((1-0.7182333)^2)*(0.7182333^10)
I'm sure I'm missing something here, and there is a much easier way to solve this.
Any help is much appreciated.
You need the binomial coefficient
prob=pnorm(1.644, mean = 0, sd = 1, lower.tail=TRUE)-pnorm(-1.644, mean = 0, sd = 1, lower.tail=TRUE)
dbinom(2, 12, 1-prob)
prob^10 * (1-prob)^2 * choose(12, 2)
0.2304877
Related
Just wondering what sd=1:10 really means in rnorm function (never seen this before) u <- rnorm(mean = 0, sd = 1:10, n = 10). Is there any simple way to check each value of sd using r command?
If you pass a vector as sd, each value will be applied to each simulation until your n is reached, for example:
rnorm(mean = 0, sd = c(0,100000), n = 5)
0.00 116963.06 0.00 25354.82 0.00
So the first value is from a normal distribution with mean = 0 and standard deviation = 0, the second value has a sd of 100.000, the third value has a sd of 0,...
In your example each value will be simulated from a normal distribution with a different sd, going from 1 to 10.
I want to calculate the integral of the Normal Distribution at exactly some point - I know that to do this, this is the equivalent of integrating the Normal Distribution at that point and at some point slightly after that point : then, you can subtract both of these values and get an approximate answer.
I tried doing this in R:
a = pnorm(1.96, mean = 0, sd = 1, log = FALSE)
b = pnorm(1.961, mean = 0, sd = 1, log = FALSE)
final_answer = b - a
#5.83837e-05
Is it possible to do this in one step instead of manually subtracting "a" and "b"?
Thank you!
We need to be clear about what you are asking here. If you are looking for the integral of a normal distribution at a specific point, then you can use pnorm, which is the anti-derivative of dnorm.
We can see this by reversing the process and looking at the derivative of pnorm to ensure it matches dnorm:
# Numerical approximation to derivative of pnorm:
delta <- 10^-6
(pnorm(0.75 + delta) - pnorm(0.75)) / delta
#> [1] 0.3011373
Note that this is a very close approximation of dnorm
dnorm(0.75)
#> [1] 0.3011374
So the anti-derivative of a normal distribution density at point x is given by:
pnorm(x)
You can try this
> diff(pnorm(c(1.96, 1.961), mean = 0, sd = 1, log = FALSE))
[1] 5.83837e-05
It is the case that the probability density for a standardized and unstandardized random variable will differ. E.g., in R
dnorm(x = 0, mean = 1, sd = 2)
dnorm(x = (0 - 1)/2)
However,
pnorm(q = 0, mean = 1, sd = 2)
pnorm(q = (0 - 1)/2)
yields the same value.
Are there any situations in which the Normal cumulative density function will yield a different probability for the same random variable when it is standardized versus unstandardized? If yes, is there a particular example in which this difference arises? If not, is there a general proof of this property?
Thanks so much for any help and/or insight!
This isn't really a coding question, but I'll answer it anyway.
Short answer: yes, they may differ.
Long answer:
A normal distribution is usually thought of as y=f(x), that is, a curve over the domain of x. When you standardize, you are converting from units of x to units of z. For example, if x~N(15,5^2), then a value of 10 is 5 x-units less than the mean. Notice that this is also 1 standard deviation less than the mean. When you standardize, you convert x to z~N(0,1^2). Now, that example value of 10, when standarized into z-units, becomes a value of -1 (i.e., it's still one standard deviation less than the mean).
As a result, the area under the curve to the left of x=10 is the same as the area under the curve to the left of z=-1. In words, the cumulative probability up to those cut-offs is the same.
However, the height of curves is different. Let the normal distribution curves be f(x) and g(z). Then f(10) != g(-1). In code:
dnorm(10, 15, 5) != dnorm(-1, 0, 1)
The reason is that the act of standardizing either "spreads" or "squishes" the f(x) curve to make it "fit" over the new z domain as g(z).
Here are two links that let you visualize the spreading/squishing:
https://academo.org/demos/gaussian-distribution/
https://www.intmath.com/counting-probability/normal-distribution-graph-interactive.php
Hope this helps!
I have the following function:
samp315<-function(n=30, desmean=86, distance=3.4995) {
x = seq(from = 0, to = 100, by = 0.1)
samp<-0
while (!between(mean(samp),desmean-distance,desmean+distance)) samp<-sample(x,n,replace=TRUE)
samp
}
percent <- samp315()
so pretty much I want to generate 30 numbers within 0-100 that has a mean of 86+/-3.4995, however whenever I run the last line it will load forever or when I am lucky it will genrate a list of desired results. Any idea on how i could change the function to improve its functionality?
As suggested by Parfait in the comments, you're using a randomization strategy that gives a low probability of providing the condition you're interested in. Did no other answer to this question help you out?
Some other possible strategies for you to try out.
n = 30
# Using truncated normal
library(truncnorm)
x = round(rtruncnorm(n, a = -0.0495, b = 100.0495, mean = 85, sd = 3.5*2), 1)
# Using beta
sig = 3
x = round(100*rbeta(n, (0.85)*sig, (1-0.85)*sig), 1)
The round(..., 1) is meant to align with your vector x. These methods would both have very few values away from 85. It's a trade-off you have to consider. If you want to have a mean in 85 +/- 3.5, then you can't too many values below 10, for example. So you have to lower the probability of such values being selected. Using your function, when it is completed, you'll probably find that values closer to 85 are more represented.
I run a path analysis in R and the following matrix represents the effect between variables.
M <- c(0, 0, 0, 0, 0)
p<-c(0, 0, 0, 0, 0)
O <- c(0, 0, 0, 0, 0)
T <- c(1, 0, 1, 1, 0)
Sales <- c(1, 1, 1, 1, 0)
sales_path <- rbind(M, p, O, T, Sales)
colnames(sales_path) <- rownames(sales_path)
#innerplot(sales_pls)
sales_blocks <- list(
c("m1", "m2"),
#c("pr"),
c("R1"),
#c("C1"),
c("tt1"),
c("Sales")
)
sales_modes = rep("A", 5)
sales_pls <- plspm(input_file, sales_path, sales_blocks, scheme = "centroid", scaled = FALSE, modes = sales_modes)
I have 2 questions:
The weights i receive can i use them to calculate the value of the latent variable e.g. My M variable has the manifest variables is there a formula to calculate its value?
The main purpose i run path analysis is to predict the sales. Is that possible by using the estimations(beta) for each latent variable?
i want to know if i am able to calculate the value of the latent
variable
Yes. Its actually very easy. Your latent variable scores can be obtained by running sales_pls$scores. You can also run summary(sales_pls). For easier interpretation you may wish to have the latent variables expressed in the same scale as the indicators (manifest variables). This can be accomplished by normalizing the outer weights in each block of indicators so that the weights are expressed as proportions. However, in order to apply this normalization all the outer weights must be positive.
You can apply the function rescale() to the object sales_pls and get scores expressed in the original scale of the manifest variables. Once you get the rescaled scores you can use summary() to verify that the obtained results make sense (i.e. scores expressed in original scale of indicators):
# rescaling scores
rescaled_sales_pls = rescale(sales_pls)
# summary
summary(rescaled_sales_pls)
(again you can also run rescaled_sales_pls)
if i can predict Sales using the betas from the output
Theoretically I guess you could, but I'm not really sure why you would want to. The utility of path analysis here is to decompose the sources of a correlation between an independent variable and a dependent variable of a multiple regression model.