Calculate standard deviation using R [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
Using the R script solve the following:
An expert on process control states that he is 95% confident that the new production process will save between $26 and $38 per unit with savings values around $32 more likely.
If you were to model this expert's opinion using a normal distribution (by applying empirical rule), what standard deviation would you use for your normal distribution? (round your answer to 1 decimal place.

It appears that whoever wrote this problem is confused and doesn't know if they are asking a sample mean confidence interval problem "95% confident" or a simple population normal distribution problem.
Let's try to rationalize out how we can or can't solve this , and we will discover some problems with the wording of this problem as we go:
He says he is 95% confident that ... [ignore everything else for now and assume that this is a confidence interval of the mean problem ... and we will see why that is wrong] ... First let's calculate the z score on the normal distribution table which corresponds with a .95 % cumulative probability. You said you want to do this in R, so use qnorm()
> qnorm(.95)
[1] 1.644854
Now we knew that the money saved is between: $26 and $38. $26 we now know is 1.644854 standard errors below the sample mean if his 95% confidence implies that this is a sample mean and $38 we now know is 1.644854 standard errors above the estimated mean from their sample mean ... (if this were a confidence interval problem). Their sample mean is presumably $32.
let's say we try to solve for the st dev. The standard error is:
StDev / sqrt(sample size)
and the confidence interval is:.
lower bound: 32 - 1.644854 * StDev / sqrt(sample size);
upper bound: 32 + 1.644854 * StDev / sqrt(sample size) # we will use this below
We could attempt to solve algebraically for StDev by putting the upper bound formula on the left side of the = sign and put 38 which is the upper bound on the right side:
32 + (1.644854 * StDev / sqrt(sample size)) = $38 ... Now solve for StDev
StDev = (sqrt(sample size)* (38-32))/1.644854) ... If I didn't screw up my mental algebra at midnight without paper in hand...
There's a problem here that this rhetorical exercise was meant to point out: We still have 2 variables. The problem you posted simply didn't tell us enough information to solve this with the assumption this was a confidence interval from a sample. We are kind of out of luck if this is where they were going with this.
It looks like the 95% confidence clause (and absence of a mention of a sample mean) is meant to throw you off, but in reality, it just makes the person asking you this question appear to be confused as to what question they are asking you.
If you re-frame the question and assume that 1. the 95% confident clause is junk information 2. We are talking about individual probabilities that a given observation falls at or below a given value, not that we are 95% confident that the average observation does, and 3. That whoever wrote the question does not understand the proper usage of the phrase 95% confident or was exhausted when they wrote it ... or you mis-transcribed the problem ... Then the question should be would be worded like this: "We know that in 95% of all cases saved no more than $38 and 5% of customers saved $26 or less." In this case we could drop the standard error term altogether and we would then only be worried about the standard deviation and mean of the population:
The population mean then is 32
the mean + 1.644854 standard deviations is 38 (95% of customers save no more than this)
38 - 32 = 6 (this is equal to 1.644854 StDev): Algebraically that's written:
6 = 1.644854 * stdev
divide both sides by 1.644854:
6/1.644854 = StDev
StDev = 3.64774
Let's verify that we got this right:
> qnorm(.95,mean=32,sd=3.64774) # Verify in R that the stdev we calculated is correct: .95 cumulative probability, mean is 32 and we assert that the StDev is 3.64774. We got:
[1] 38
$38 or less is what 95% would get. This seems right.
> qnorm(.05,mean=32,sd=3.64774)
[1] 26
26 or less is what the 5 % of customers that saved the least got. This seems right also.
Summary:
The question you posted doesn't make any sense. It is either incomplete,mis-transcribed, or whoever wrote it seems to be a little confused.
If you ignore the 95% confidence clause and reframe the question to make a guess to compensate for how ambiguous it was, then the answer is: The standard deviation is 3.6.

According to Empirical rule of normal distribution:
68% of the data falls within one standard deviation, 95% percent within two standard deviations, and 99.7% within three standard deviations from the mean.
As the data says 95% confident, data will fall within 2 standard deviation.
So, min_value: 26=mean-2standard_deviation
or, max_value: 38=mean+2standard_deviation
Given, mean=32, Solving either of above two equations, standard_deviation=3.0

I used this code for lower limit =2 and upper limit =3, and it worked correctly, for lower values of the limits but it doesn't work for larger numbers unless I add 0.5 to sd
> f <- function(lwr, upr){
> c("mean"= (upr+lwr)/2,
> "stddev" = (upr-lwr)/4,
> "sdRound" =round((upr-lwr)/4,1)) }
> f(2,3)
With this, I get the answers as:
mean stddev sdRound
2.50 0.25 0.20
I can't use the value rounded in R. The correct answer is 0.3 since 0.25 when rounded is 0.3.
When I plug this sd=0.3 in the below, I get the correct upperlimit (and also lower limit)
> upperlimit = round(qnorm(0.95, mean=2.5, sd=0.3),0)
> lowerlimit = round(qnorm(0.05, mean=2.5, sd=0.3))
upperlimit =3
lowerlimit =2
This also works for f(6,9)

Related

Having trouble getting a discrete example of Predict functions confidence interval calculation in R

Hi all I am looking to figure out how the Predict function is calculating the confidence intervals. I know there are similar questions, but I have trouble conceptualizing longer equations without numerical representation (one reason I love coding, I just stink!).
My two part question is really how did Predict calculate the confidence interval of say the $3.5 below. Lower and upper bounds have different deltas throughout so I know the margin of error is different. I tried look through several formulas for margin of error to add to yhat and could not figure this out or what standard deviation was actually being used (I used the population stdev to no avail).
lastly if someone could just provide some sort of example function for the limits when it draws to a graph. There was a very good response here https://stats.stackexchange.com/questions/85560/shape-of-confidence-interval-for-predicted-values-in-linear-regression I just still have no clue how it looks implemented.
if you changed models to logistic (glm) would CI be calculated similarly? or would assumptions of residual error break it.
Thank you!
#fake example of giving someone money and how much a smile is returned
Money<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8)
Smile<-c(2.618684,3.004371,2.226206,3.218504,4.206926,5.361271,6.484110,5.412502,3.309511,7.934290,8.286108,8.421875,9.865312,10.163182,9.381625,12.789413,12.404028,12.002910,13.805863,11.978898,13.448826,17.288642,17.105757,16.648129
)
model<-lm(Smile~Money)
new.money=data.frame(Money=c(3,3.5,4.6,5.6))
prediction<-predict(model,new.money, interval="confidence") #start the prediction
prediction<-cbind(new.money,prediction)
print(prediction)
The output is:
Money fit lwr upr
1 3.0 6.049064 5.555977 6.542151
2 3.5 7.051695 6.601570 7.501819
3 4.6 9.257482 8.844543 9.670421
4 5.6 11.262743 10.805119 11.720368

How can I create a normal distributed set of data in R?

I'm a newbie in statistics and I'm studying R.
I decided to do this exercise to pratice some analysis with an original dataset.
This is the issue: I want to create a datset of let's say 100 subjects and for each one of them I have a test score.
This test score has a range that goes from 0 to 70 and the mean score is 48 (and its improbable that someone scores 0).
Firstly I tried to create the set with x <- round(runif(100, min=0, max=70)) , but then I found out that were not normally distributed using plot(x).
So I searched another Rcommand and found this, but I couldn't decide the min\max:
ex1 <- round(rnorm(100, mean=48 , sd=5))
I really can't understand what I have to do!
I would like to write a function that gives me a set of data normally distributed, in a range of 0-70, with a mean of 48 and a not so big standard deviation in order to do some T-test later...
Any help?
Thanks a lot in advance guys
The normal distribution, by definition, does not have a min or max. If you go more than a few standard deviations from the mean, the probability density is very small, but not 0. You can truncate a normal distribution, chopping of the tails. Here, I use pmin and pmax to set any values below 0 to 0, and any values above 70 to 70:
ex1 <- round(rnorm(100, mean=48 , sd=5))
ex1 <- pmin(ex1, 70)
ex1 <- pmax(ex1, 0)
You can calculate the probability of an individual observation being below or above a certain point using pnorm. For your mean of 48 and SD of 5, the probability an individual observation is less than 0 is very small:
pnorm(0, mean = 48, sd = 5)
# [1] 3.997221e-22
This probability is so small that the truncation step is unnecessary in most applications. But if you started experimenting with bigger standard deviations, or mean values closer to the bounds, it could become necessary.
This method of truncation is simple, but it is a bit of a hack. If you truncated a distribution to be within 1 SD of the mean using this method, you would end up with spikes a the upper and lower bound that are even higher than the density at the mean! But it should work well enough for less extreme applications. A more robust method might be to draw more samples than you need, and keep the first n samples that fall within your bounds. If you really care to do things right, there are packages that implement truncated normal distributions.
(Because the normal distribution is symmetric, and 100 is farther from your mean than 0, the probability of observations > 100 are even smaller.)

R `t.test` problem cause `159 equal to 8875`

R version is 3.5.3 (2019-03-11),have a look of below output:
> t.test(a$score,a$time,paired=FALSE)
Welch Two Sample t-test
data: a$score and a$time
t = -1.4861, df = 8382, p-value = 0.1373
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-20215.279 2781.535
sample estimates:
mean of x mean of y
159.0481 8875.9203
p value is 0.1373(>0.05),but the mean of 2 variables are 159.0481 and 8875.9203.
I upload .Rdata to https://file.io/EH9XV44u
Anything wrong with my t.test?
I think the title of your question shows the problem here.
The result that a mean of 159 in one set of data does not demonstrate a significant difference to a mean of 8875 in another set of data doesn't mean '159 equal to 8875'.
It just means that the data allow sufficient uncertainty about the 'true' values of the means (from which the data were picked) that you can't say with confidence that they are different.
Even though - intuitively - you might think that 159 'looks' very different to 8875, we perform a statistical test to verify (or refute) our intuition that this difference couldn't have arisen by chance. In this case, it seems that intuition is wrong.
As Edward & Hong Oui have said in the comments, this is probably because one (or both) of your datasets are very dispersed, so the mean alone doesn't reflect the amount of uncertainty.
An extreme example, which might make this clearer:
data1: c(7,105,365) = mean 149
data2: c(3,22,26600) = mean 8875
It's clear (to me) that we can't be very confident that data1 and data2 are really different, since the difference in the mean value arises from just a single high value in data2. So, although the means seem very different, we don't expect that this will be significant if we test it.
Indeed:
t.test(data1,data2)
# p-value = 0.4291
I guess that if you look closely at your own data you'll find something similar...

Find the standard deviation with normal probabilities using R [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
Let's say that there is a variable A that has a Normal distribution N(μ,σ).
I have two probabilities when P(A>a) and P(A<b), where a<b, and the given probability is expressed in %.(as an example)
With this information can R find the standard deviation? I don't know which commands to use? qnorm, dnorm,.. so I can get the Standard Deviation.
What I tried to do was, knowing that a = 100, b = 200 , P(A>a) = 5% and P(A<b)=15%:
Use the standarized Normal distribution considering μ = 0, σ = 1 (But i dont know how to put it in R, so I can get what I want)
See the probability in the normal distribution table and calculate Z ..., but it didn't work.
Is there a way R can find the standard deviation with just these information??
Your problem as stated is impossible, check that your inequalities and values are correct.
You give the example that p(A > 100) = 5% which means that the p( A < 100 ) = 95% which means that p( A < 200 ) must be greater than 95% (all the probability between 100 and 200 adds to the 95%), but you also say that p( A < 200 ) = 15%. There is no set of numbers that can give you a probability that is both greater than 95% and equal to 15%.
Once you fix the problem definition to something that works there are a couple of options. Using Ryacas you may be able to solve directly (2 equations and 2 unkowns), but since this is based on the integral of the normal I don't know if it would work or not.
Another option would be to use optim or similar programs to find(approximate) a solution. Create an objective function that takes 2 parameters, the mean and sd of the normal, then computes the sum of the squared differences between the stated percentages and those computed based on current guesses. The objective function will be 0 at the "correct" mean and standard deviation and positive everywhere else. Then pass this function to optim to find the minimum.

calculating the confidence intervals between two approaches

I'm trying to do something like what has mentioned in "VSUMM" article.
There are 5 approaches which i want to compare their results with mine. i do compare all of them in a table by calculating the CUS(a) i.e mean accuracy rate and CUS(e) i.e mean error rate for each of them.(it's a video summarization problem and each approaches has got their summeries for 50 videos. this summeries has been compared with five user summeries and then the CUS(a) and CUS(e) are calculated by these reults)
now i should calculate the confidence intervals(with the confidence of 98%) of each of those approaches with mine
i surf the web and find long articles about confidence intervals but i did not really understand any thing!?
what should i do?
finally find the answer:
For calculating the confidence intervals between your approach and other approaches you should have the results of your algorithm and those approaches for sure. then you can calculate mean of these data and their Standard deviation. then you could use the confidence intervals formula:
(amean-bmean) +/- z(1-a/2)*sqrt((stda^2/n1)+(stdb^2/n2)
which stda is Standard deviation of a
n1 the number of data in first group
amean=the mean value of the first group, i.e the mean of results of first approach

Resources