Find the standard deviation with normal probabilities using R [closed] - r

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
Let's say that there is a variable A that has a Normal distribution N(μ,σ).
I have two probabilities when P(A>a) and P(A<b), where a<b, and the given probability is expressed in %.(as an example)
With this information can R find the standard deviation? I don't know which commands to use? qnorm, dnorm,.. so I can get the Standard Deviation.
What I tried to do was, knowing that a = 100, b = 200 , P(A>a) = 5% and P(A<b)=15%:
Use the standarized Normal distribution considering μ = 0, σ = 1 (But i dont know how to put it in R, so I can get what I want)
See the probability in the normal distribution table and calculate Z ..., but it didn't work.
Is there a way R can find the standard deviation with just these information??

Your problem as stated is impossible, check that your inequalities and values are correct.
You give the example that p(A > 100) = 5% which means that the p( A < 100 ) = 95% which means that p( A < 200 ) must be greater than 95% (all the probability between 100 and 200 adds to the 95%), but you also say that p( A < 200 ) = 15%. There is no set of numbers that can give you a probability that is both greater than 95% and equal to 15%.
Once you fix the problem definition to something that works there are a couple of options. Using Ryacas you may be able to solve directly (2 equations and 2 unkowns), but since this is based on the integral of the normal I don't know if it would work or not.
Another option would be to use optim or similar programs to find(approximate) a solution. Create an objective function that takes 2 parameters, the mean and sd of the normal, then computes the sum of the squared differences between the stated percentages and those computed based on current guesses. The objective function will be 0 at the "correct" mean and standard deviation and positive everywhere else. Then pass this function to optim to find the minimum.

Related

Calculate standard deviation using R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
Using the R script solve the following:
An expert on process control states that he is 95% confident that the new production process will save between $26 and $38 per unit with savings values around $32 more likely.
If you were to model this expert's opinion using a normal distribution (by applying empirical rule), what standard deviation would you use for your normal distribution? (round your answer to 1 decimal place.
It appears that whoever wrote this problem is confused and doesn't know if they are asking a sample mean confidence interval problem "95% confident" or a simple population normal distribution problem.
Let's try to rationalize out how we can or can't solve this , and we will discover some problems with the wording of this problem as we go:
He says he is 95% confident that ... [ignore everything else for now and assume that this is a confidence interval of the mean problem ... and we will see why that is wrong] ... First let's calculate the z score on the normal distribution table which corresponds with a .95 % cumulative probability. You said you want to do this in R, so use qnorm()
> qnorm(.95)
[1] 1.644854
Now we knew that the money saved is between: $26 and $38. $26 we now know is 1.644854 standard errors below the sample mean if his 95% confidence implies that this is a sample mean and $38 we now know is 1.644854 standard errors above the estimated mean from their sample mean ... (if this were a confidence interval problem). Their sample mean is presumably $32.
let's say we try to solve for the st dev. The standard error is:
StDev / sqrt(sample size)
and the confidence interval is:.
lower bound: 32 - 1.644854 * StDev / sqrt(sample size);
upper bound: 32 + 1.644854 * StDev / sqrt(sample size) # we will use this below
We could attempt to solve algebraically for StDev by putting the upper bound formula on the left side of the = sign and put 38 which is the upper bound on the right side:
32 + (1.644854 * StDev / sqrt(sample size)) = $38 ... Now solve for StDev
StDev = (sqrt(sample size)* (38-32))/1.644854) ... If I didn't screw up my mental algebra at midnight without paper in hand...
There's a problem here that this rhetorical exercise was meant to point out: We still have 2 variables. The problem you posted simply didn't tell us enough information to solve this with the assumption this was a confidence interval from a sample. We are kind of out of luck if this is where they were going with this.
It looks like the 95% confidence clause (and absence of a mention of a sample mean) is meant to throw you off, but in reality, it just makes the person asking you this question appear to be confused as to what question they are asking you.
If you re-frame the question and assume that 1. the 95% confident clause is junk information 2. We are talking about individual probabilities that a given observation falls at or below a given value, not that we are 95% confident that the average observation does, and 3. That whoever wrote the question does not understand the proper usage of the phrase 95% confident or was exhausted when they wrote it ... or you mis-transcribed the problem ... Then the question should be would be worded like this: "We know that in 95% of all cases saved no more than $38 and 5% of customers saved $26 or less." In this case we could drop the standard error term altogether and we would then only be worried about the standard deviation and mean of the population:
The population mean then is 32
the mean + 1.644854 standard deviations is 38 (95% of customers save no more than this)
38 - 32 = 6 (this is equal to 1.644854 StDev): Algebraically that's written:
6 = 1.644854 * stdev
divide both sides by 1.644854:
6/1.644854 = StDev
StDev = 3.64774
Let's verify that we got this right:
> qnorm(.95,mean=32,sd=3.64774) # Verify in R that the stdev we calculated is correct: .95 cumulative probability, mean is 32 and we assert that the StDev is 3.64774. We got:
[1] 38
$38 or less is what 95% would get. This seems right.
> qnorm(.05,mean=32,sd=3.64774)
[1] 26
26 or less is what the 5 % of customers that saved the least got. This seems right also.
Summary:
The question you posted doesn't make any sense. It is either incomplete,mis-transcribed, or whoever wrote it seems to be a little confused.
If you ignore the 95% confidence clause and reframe the question to make a guess to compensate for how ambiguous it was, then the answer is: The standard deviation is 3.6.
According to Empirical rule of normal distribution:
68% of the data falls within one standard deviation, 95% percent within two standard deviations, and 99.7% within three standard deviations from the mean.
As the data says 95% confident, data will fall within 2 standard deviation.
So, min_value: 26=mean-2standard_deviation
or, max_value: 38=mean+2standard_deviation
Given, mean=32, Solving either of above two equations, standard_deviation=3.0
I used this code for lower limit =2 and upper limit =3, and it worked correctly, for lower values of the limits but it doesn't work for larger numbers unless I add 0.5 to sd
> f <- function(lwr, upr){
> c("mean"= (upr+lwr)/2,
> "stddev" = (upr-lwr)/4,
> "sdRound" =round((upr-lwr)/4,1)) }
> f(2,3)
With this, I get the answers as:
mean stddev sdRound
2.50 0.25 0.20
I can't use the value rounded in R. The correct answer is 0.3 since 0.25 when rounded is 0.3.
When I plug this sd=0.3 in the below, I get the correct upperlimit (and also lower limit)
> upperlimit = round(qnorm(0.95, mean=2.5, sd=0.3),0)
> lowerlimit = round(qnorm(0.05, mean=2.5, sd=0.3))
upperlimit =3
lowerlimit =2
This also works for f(6,9)

Calculating confidence intervall for quantiles first by hand (than in R) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
It would be great if someone can check whether my approach is correct or not.
Question in short will be, if the error calculation is the correct way.
lets assume i have the following data.
data = c(23.7,25.47,25.16,23.08,24.86,27.89,25.9,25.08,25.08,24.16,20.89)
Furthermore i want to check if my data follows a normal distribution.
Edit: I know that there are tests etc. but i will concentrate on constructing the qqplot with confidence lines. I know that there is a method in the car package, but i want to understand the building of these lines.
So i calculate the percentiles for my sample data as well as for my theoretical distribution (with estimated mu = 24.6609and sigma = 1.6828. So i end up with these two vectors containing the percentiles.
percentileReal = c(23.08,23.7,24.16,24.86,25.08,25.08,25.16,25.47,25.90)
percentileTheo = c(22.50,23.24,23.78,24.23,24.66,25.09,25.54,26.08,26.82)
Now i want to calculate the confidence intervall for alpha=0.05 for the theoretical percentiles. If i rembember myself correct, the formula is given by
error = z*sigma/sqrt(n),
value = +- error
with n=length(data) and z=quantil of the normal distribution for the given p.
So in order to get the confidence intervall for the 2nd percentile i'll do the following:
error = (qnorm(20+alpha/2,mu,sigma)-qnorm(20-alpha/2,mu,sigma))*sigma/sqrt(n)
Insert the values:
error = (qnorm(0.225,24.6609,1.6828)-qnorm(0.175,24.6609,1.6828)) * 1.6828/sqrt(11)
error = 0.152985
confidenceInterval(for 2nd percentil) = [23.24+0.152985,23.24-0.152985]
confidenceInterval(for 2nd percentil) = [23.0870,23.3929]
Finally i have
percentileTheoLower = c(...,23.0870,.....)
percentileTheoUpper = c(...,23.3929,.....)
same for the rest....
So what do you think, can i go with it?
If your goal is to test if the data follows a normal distribution, use the shapiro.wilk test:
shapiro.test(data)
# Shapiro-Wilk normality test
# data: data
# W = 0.9409, p-value = 0.5306
1-p is the probability that the distribution is non-normal. So, since p>0.05 we cannot assert that the distribution is non-normal. A crude interpretation is that "there is a 53% chance that the distribution is normal."
You can also use qqplot(...). The more nearly linear this plot is, the more likely it is that your data is normally distributed.
qqnorm(data)
Finally, there is the nortest package in R which has, among other things, the Pearson Chi-Sq test for normality:
library(nortest)
pearson.test(data)
# Pearson chi-square normality test
# data: data
# P = 3.7273, p-value = 0.2925
This (more conservative) test suggest that there is only a 29% chance that the distribution is normal. All these tests are fully explained in the documentation.

Determining slopes with R code [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
I have a number of melting curves, for which I want to determine the slope of the steepest part between the minimum (valley) and maximum (peak) using R code (the slope in the inflection point corresponds to the melting point). The solutions I can imagine are either to determine the slope in every point and then find the maximum positive value, or by fitting a 4-parameter Weibull-type curve using the drc package to determine the inflection point (basically corresponding to the 50% response point between minimum and maximum). In the latter case the tricky part is that this fitting has to be restricted for each curve to the temperature range between the minimum (valley) and maximum (peak) fluorescence response. These temperature ranges are different for each curve.
Grateful for any feedback!
The diff function accomplishes the equivalent of numerical differentiation on equally spaced values (up to a constant factor) so finding maximum (or minimum) values can be used to identify location of steepest ascent (or descent):
z <- exp(-seq(0,3, by=0.1)^2 )
plot(z)
plot(diff(z))
z[ which(abs(diff(z))==max(abs(diff(z))) )]
# [1] 0.6126264
# could have also tested for min() instead of max(abs())
plot(z)
abline( v = which(abs(diff(z))==max(abs(diff(z))) ) )
abline( h = z[which(abs(diff(z))==max(abs(diff(z))) ) ] )
With an x-difference of 1, the slope is just the difference at that point:
diff(z) [ which(abs(diff(z))==max(abs(diff(z))) ) ]
[1] -0.08533397
... but I question whether that is really of much interest. I would have thought that getting the index (which would be the melting point subject to an offset) would be the value of interest.

Adjusting regression weight based on feedback [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
Let's say I want to predict a dependent variable D, where:
D<-rnorm(100)
I cannot observe D, but I know the values of three predictor variables:
I1<-D+rnorm(100,0,10)
I2<-D+rnorm(100,0,30)
I3<-D+rnorm(100,0,50)
I want to predict D by using the following regression equation:
I1 * w1 + I2 * w2 + I3 * w3 = ~D
however, I do not know the correct values of the weights (w), but I would like to fine-tune them by repeating my estimate:
in the first step I use equal weights:
w1= .33, w2=.33, w3=.33
and I estimate D using these weights:
EST= I1 * .33 + I2 * .33 + I3 *. 33
I receive feedback, which is a difference score between D and my estimate (diff=D-EST)
I use this feedback to modify my original weights and fine-tune them to eventually minimize the difference between D and EST.
My question is:
Is the difference score sufficient for being able to fine-tune the weights?
What are some ways of manually fine-tuning the weights? (e.g. can I look at the correlation between diff and I1,I2,I3 and use that as a weight?
The following command,
coefficients(lm(D ~ I1 + I2 + I3))
will give you the ideal weights to minimize diff.
Your defined diff will not tell you enough to manually manipulate the weights correctly as there is no way to isolate the error component of each I.
The correlation between D and the I's is not sufficient either as it only tells you the strength of the predictor, not the weight. If your I's are truly independent (both from each other, all together and w.r.t. D - a strong assumption, but true when using rnorm for each), you could try manipulating one at a time and notice how it affects diff, but using a linear regression model is the simplest way to do it.

Estimating the rate of convergence, finite difference scheme [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I'm trying to estimate the rate of convergence of a sequence.
background:
u^n+1 = G u_n, where G is a iteration matrix (coming from heat equation).
Fixing dx = 0.1, and setting dt = dx*dx/2.0 to satisfy the a stability constraint
I then do a number of iterations up to time T = 0.1, and calculate the error (analytical solution is known) using max-norm.
This gives me a sequence of global errors, which from the theory should be of the form O(dt) + O(dx^2).
Now, I want to confirm that we have O(dt).
How should I do this?
Relaunch the same code with dt/2 and witness the error being halved.
I think Alexandre C.'s suggestion might need a little refinement (no pun intended) because the global error estimate depends on both Δt and Δx.
So if Δx were too coarse, refining Δt by halving might not produce the expected reduction of halving the error.
A better test might then be to simultaneously reduce Δt by quartering and Δx by halving. Then the global error estimate leads us to expect the error reduced by quartering.
Incidently it is common to plot the global error and "scales" as a log-log graph to estimate the order of convergence.
With greater resources (of time and computer runs) independently varying the time and space discretizations would allow a two-parameter fit (of the same sort of log-log model).
I suck at physics, but simple problems like this, even I can do.
Well, with what do you have problem with?
Calculating rate of the convergence:
If you have series defined as ( Sum[a[n], {n, 1, Infinity}] ), then you need to find location, where the series converges ( L=Limit[a[n], n -> Infinity] ).
Now you can find the rate of the convergence ( μ = Limit[(a[n + 1] - L)/(a[n] - L), n -> Infinity] )
Finding the combined uncertainty with analytical solution
Using the equation:
( Uc =
Sqrt[(D[a, t] Δt)^2 + (D[a, x] Δx)^2] )

Resources