Calculating the probability of a sample mean using R - r

I'm in an intro to stats class right now, and have absolutely no idea what's going on. How would I solve the following problem using R?
Let x be a continuous random variable that has a normal distribution with a mean of 71 and a standard deviation of 15. Assuming n/N is less than or equal to 0.05, find the probability that the sample mean, x-bar, for a random sample of 24 taken from this population will be between 68.1 and 78.3,
I'm really struggling on this one and I still have to get through other problems in the same format. Any help would be greatly appreciated!

For R coding this might set you up:
[# Children's IQ scores are normally distributed with a
# mean of 100 and a standard deviation of 15. What
# proportion of children are expected to have an IQ between
# 80 and 120?
mean=100; sd=15
lb=80; ub=120
x <- seq(-4,4,length=100)*sd + mean
hx <- dnorm(x,mean,sd)
plot(x, hx, type="n", xlab="IQ Values", ylab="",
main="Normal Distribution", axes=FALSE)
i <- x >= lb & x <= ub
lines(x, hx)
polygon(c(lb,x\[i\],ub), c(0,hx\[i\],0), col="red")
area <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)
result <- paste("P(",lb,"< IQ <",ub,") =",
signif(area, digits=3))
mtext(result,3)
axis(1, at=seq(40, 160, 20), pos=0)]
There is also some nice introductory course to R and data analysis by datacamp, this might also come in handy:
https://www.datacamp.com/courses/exploratory-data-analysis
And another tutorial on R and statistics:
http://www.cyclismo.org/tutorial/R/confidence.html

In terms of the code:
pop_sample <- rnorm(24, 71, 15)
se_pop <- sd(pop_sample)/sqrt(24)
pnorm(78.3, 71, se_pop) - pnorm(68.1, 71, se_pop) # 80%
In term of stats... you should probably refer to stats.stackexchange.com or your professor.

Related

Why bootstrapping tecnique is not being accurate for my non normal distribution?

My code:
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sample1 <- sample(population1, 30, replace=F)
mean(sample1)
sample_bs <- replicate(200, mean(sample(sample1, 30, replace=T)))
mean(sample_bs)
gmodels::ci(sample_bs)
My results:
population mean: 0.6
sample mean: 0.4666
results from my boostrapping procedures:
Estimate CI lower CI upper Std. Error
0.467500000 0.450542365 0.484457635 0.008599396
So my question is: why the results from this procedures still far from the original population?
You are not sampling from your original population, you are sampling from your first sample! So the mean will be close to the mean of sample1, not population1.
Try this instead, and you will see how both results are close:
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sample1 <- sample(population1, 30, replace=F)
mean(sample1)
sample_bs <- replicate(200, mean(sample(population1, 30, replace=T)))
mean(sample_bs)
gmodels::ci(sample_bs)
Your main issue is that the sample size is too small. We can make some educated guesses at the precision you can expect from a sample size of 30, and what sample size you'd need for a given expected precision.
We will assume a normal distribution, but for this to be reasonable, the distribution of what you want to measure needs to be reasonably normal. That is, the distribution of means needs to be reasonably normal, the distribution of a given sample needs not.
So…
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sm30 <- scale(replicate(10000, mean(sample(population1, 30, rep=FALSE))))
s <- seq(-3, 3, by=0.01)
plot(s, dnorm(s, 0, 1), type="l", col="grey80", lwd=9)
lines(density(sm30), col=3, lwd=2)
That looks fairly normal (ish) to me. Some skew, but close enough for government work, as they say (CERN might demand better)
Assuming this is normal enough, we can continue estimating (this is Stat101 stuff):
z <- qnorm(1-(0.05/2))
0.6 + z*c(-1, 1)*sqrt(0.6/30)
# 0.32281924 0.87718076
With a confidence level of 5% and a sample size of 30, we get some pretty wide confidence intervals (CI), as you experienced.
By rearranging the expression a bit, we can figure out what sample size we need for a given precision at a given confidence level. If we keep the 5% and say we want the CI to be at 0.6±0.1 we get:
(n <- ceiling((z*sqrt(0.6)/0.1)^2))
# 231
0.6 + z*c(-1, 1)*sqrt(0.6/n)
# 0.50011099 0.69988901
That's a bit more than 30.
While bootstrapping can help you estimating confidence intervals of non-normal distributions, it can't make up for missing information. The amount of information is inherent in the variance in your data, and the size of your sample.

How to Standardize a Column of Data in R and Get Bell Curve Histogram to fins a percentage that falls within a ranges?

I have a data set and one of columns contains random numbers raging form 300 to 400. I'm trying to find what proportion of this column in between 320 and 350 using R. To my understanding, I need to standardize this data and creates a bell curve first. I have the mean and standard deviation but when I do (X - mean)/SD and get histogram from this column it's still not a bell curve.
This the code I tried.
myData$C1 <- (myData$C1 - C1_mean) / C1_SD
If you are simply counting the number of observations in that range, there's no need to do any standardization and you may directly use
mean(myData$C1 >= 320 & myData$C1 <= 350)
As for the standardization, it definitely doesn't create any "bell curves": it only shifts the distribution (centering) and rescales the data (dividing by the standard deviation). Other than that, the shape itself of the density function remains the same.
For instance,
x <- c(rnorm(100, mean = 300, sd = 20), rnorm(100, mean = 400, sd = 20))
mean(x >= 320 & x <= 350)
# [1] 0.065
hist(x)
hist((x - mean(x)) / sd(x))
I suspect that what you are looking for is an estimate of the true, unobserved proportion. The standardization procedure then would be applicable if you had to use tabulated values of the standard normal distribution function. However, in R we may do that without anything like that. In particular,
pnorm(350, mean = mean(x), sd = sd(x)) - pnorm(320, mean = mean(x), sd = sd(x))
# [1] 0.2091931
That's the probability P(320 <= X <= 350), where X is normally distributed with mean mean(x) and standard deviation sd(x). The figure is quite different from that above since we misspecified the underlying distribution by assuming it to be normal; it actually is a mixture of two normal distributions.

How to check if a value doesn't appear by chance with respect to a distribution in R

I have the following randomly generated distribution:
set.seed(1)
mean=100; sd=15
x <- seq(-4,4,length=100)*sd + mean
hx <- dnorm(x,mean,sd)
plot(x, hx, type="l", lty=2, xlab="x value",
ylab="Density", main="Some random distribution")
And a "non-random" value
set.seed(1)
x <- seq(-4,4,length=100)*10 + 100
ux <- dunif(x = x, min=10, max=100)
non_random_value <- ux[1]
non_random_value
# [1] 0.01111111
I'd like to have the statistic that show non_random_value is
significant and doesn't come up by chance with respect to hx.
How can I do that in R?
The function you want is pnorm(x, mean, sd). It returns the proportion of values in the normal distribution, defined by mean and sd, that are less than x.
You can use pnorm in a two-tailed test to get the proportion of values that are more extreme than x (i.e. farther from the mean in either direction).
p <- 2*pnorm(x, mean, sd, lower.tail=x<mean)
Now interpret p to mean p% of potential values are farther away from the mean than x is. Keep in mind that any value of x can come up by chance from a normal distribution. You can't say that x is not random, merely that it seems unlikely to have come from the specified distribution.
Here's a nice site describing this is more detail.

Calculate probability of point on 2d density surface

If I calculate the 2d density surface of two vectors like in this example:
library(MASS)
a <- rnorm(1000)
b <- rnorm(1000, sd=2)
f1 <- kde2d(a, b, n = 100)
I get the following surface
filled.contour(f1)
The z-value is the estimated density.
My question now is: Is it possible to calculate the probability of a single point, e.g. a = 1, b = -4
[as I'm not a statistician this is maybe the wrong wording. Sorry for that. I would like to know - if this is possible at all - with which probability a point occurs.]
Thanks for every comment!
If you specify an area, then that area has a probability with respect to your density function. Of course a single point does not have a probability different from zero. But it does have a non-zero density at that point. What is that then?
The density is the limit of integral of that probability density integrated over the area divided by the normal area measure as the normal area measure goes to zero. (It was actual rather hard to state that correctly, needed a few tries and it is still not optimal).
All this is really basic calculus. It is also fairly easy to write a routine to calculate the integral of that density over the area, although I imagine MASS has standard ways to do it that use more sophisticated integration techniques. Here is a quick routine that I threw together based on your example:
library(MASS)
n <- 100
a <- rnorm(1000)
b <- rnorm(1000, sd=2)
f1 <- kde2d(a, b, n = 100)
lims <- c(min(a),max(a),min(b),max(b))
filled.contour(f1)
prob <- function(f,xmin,xmax,ymin,ymax,n,lims){
ixmin <- max( 1, n*(xmin-lims[1])/(lims[2]-lims[1]) )
ixmax <- min( n, n*(xmax-lims[1])/(lims[2]-lims[1]) )
iymin <- max( 1, n*(ymin-lims[3])/(lims[4]-lims[3]) )
iymax <- min( n, n*(ymax-lims[3])/(lims[4]-lims[3]) )
avg <- mean(f$z[ixmin:ixmax,iymin:iymax])
probval <- (xmax-xmin)*(ymax-ymin)*avg
return(probval)
}
prob(f1,0.5,1.5,-4.5,-3.5,n,lims)
# [1] 0.004788993
prob(f1,-1,1,-1,1,n,lims)
# [1] 0.2224353
prob(f1,-2,2,-2,2,n,lims)
# [1] 0.5916984
prob(f1,0,1,-1,1,n,lims)
# [1] 0.119455
prob(f1,1,2,-1,1,n,lims)
# [1] 0.05093696
prob(f1,-3,3,-3,3,n,lims)
# [1] 0.8080565
lims
# [1] -3.081773 4.767588 -5.496468 7.040882
Caveat, the routine seems right and is giving reasonable answers, but it has not undergone anywhere near the scrutiny I would give it for a production function.
The z-value here is a called a "probability density" rather than a "probability". As comments have pointed out, if you want an estimated probability you will need to integrate the estimated density to find the volume under your estimated surface.
However, if what you want is the probability density at a particular point, then you can use:
kde2d(a, b, n=1, lims=c(1, 1, -4, -4))$z[1,1]
# [1] 0.006056323
This will calculate a 1x1 "grid" with a single density estimate for the point you want.
A plot confirming that it worked:
z0 <- kde2d(a, b, n=1, lims=c(1, 1, -4, -4))$z[1,1]
filled.contour(
f1,
plot.axes = {
contour(f1, levels=z0, add=TRUE)
abline(v=1, lty=3)
abline(h=-4, lty=3)
axis(1); axis(2)
}
)

trying to compare two distributions

I found this code on internet that compares a normal distribution to different student distributions:
x <- seq(-4, 4, length=100)
hx <- dnorm(x)
degf <- c(1, 3, 8, 30)
colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")
plot(x, hx, type="l", lty=2, xlab="x value",
ylab="Density", main="Comparison of t Distributions")
for (i in 1:4){
lines(x, dt(x,degf[i]), lwd=2, col=colors[i])
}
I would like to adapt this to my situation where I would like to compare my data to a normal distribution. This is my data:
library(quantmod)
getSymbols("^NDX",src="yahoo", from='1997-6-01', to='2012-6-01')
daily<- allReturns(NDX) [,c('daily')]
dailySerieTemporel<-ts(data=daily)
ss<-na.omit(dailySerieTemporel)
The objectif being to see if my data is normal or not... Can someone help me out a bit with this ? Thank you very much I really appreciate it !
If you are only concern about knowing if your data is normal distributed or not, you can apply the Jarque-Bera test. This test states that under the null your data is normal distributed, see details here. You can perform this test using jarque.bera.test function.
library(tseries)
jarque.bera.test(ss)
Jarque Bera Test
data: ss
X-squared = 4100.781, df = 2, p-value < 2.2e-16
Clearly, from the result, you can see that your data is not normaly distributed since the null has been rejected even at 1%.
To see why your data is not normaly distributed you can take a look at the descriptive statistics:
library(fBasics)
basicStats(ss)
ss
nobs 3776.000000
NAs 0.000000
Minimum -0.105195
Maximum 0.187713
1. Quartile -0.009417
3. Quartile 0.010220
Mean 0.000462
Median 0.001224
Sum 1.745798
SE Mean 0.000336
LCL Mean -0.000197
UCL Mean 0.001122
Variance 0.000427
Stdev 0.020671
Skewness 0.322820
Kurtosis 5.060026
From the last two rows, one can realize that ss has an excess of kurtosis, and the skewness is not zero. This is the basis of the Jarque-Bera test.
But if you are interested in compare actual distribution of your data agaist a normal distibuted random variable with the same mean and variance as your data, you can first estimate the empirical density function from your data using a kernel and then plot it, finally you only have to generate a normal random variable with same mean and variance as you data, do something like this:
plot(density(ss, kernel='epanechnikov'))
set.seed(125)
lines(density(rnorm(length(ss), mean(ss), sd(ss)), kernel='epanechnikov'), col=2)
In this fashion you can generate other curve from another probability distribution.
The tests suggested by #Alex Reynolds will help you if your interest is to know what possible distribution your data were drawn from. If this is your goal you can take a look at any goodness-of-it test in any statistics texbook. Nevertheless, if just want to know if your variable is normally distributed then Jarque-Bera test is good enough.
Take a look at Q-Q, Shapiro-Wilk or K-S tests to see if your data are normally distributed.

Resources