Sum of N independent standard normal variables - r

I wanted to simulate sum of N independent standard normal variables.
sums <- c(1:5000)
for (i in 1:5000) {
sums[i] <- sum(rnorm(5000,0,1))
}
I tried to draw N=5000 standard normal and sum them. Repeat for 5000 simulation paths.
I would expect the expectation of sums be 0, and variance of sums be 5000.
> mean(sums)
[1] 0.4260789
> var(sums)
[1] 5032.494
The simulated expectation is too big. When I tried it again, I got 1.309206 for the mean.

#ilir is correct, the value you get is essentially zero.
If you look at the plot, you get values between -200 and 200. 0.42 is for all intents and purposes 0.
You can test this with t.test.
> t.test(sums, mu = 0)
One Sample t-test
data: sums
t = -1.1869, df = 4999, p-value = 0.2353
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-3.167856 0.778563
sample estimates:
mean of x
-1.194646
There is no evidence that your mean values differs from zero (given the null hypothesis is true).

This is just plain normal that the mean does not fall exactly on 0, because it is an empirical mean computed from "only" 5000 realizations of the random variable.
However, the distribution of your realizations contained in the sumsvector should "look" Gaussian.
For example, when I try to plot the histogram and the qqplot obtained of 10000 realizations of the sum of 5000 gaussian laws (created in this way: sums <- replicate(1e4,sum(rnorm(5000,0,1)))), it looks normal, as you can see on the following figures:
hist(sums)
qqnorm(sums)

Sum of the independent normals is again normal, with mean the sum of the means and the variance the sum of variance. So sum(rnorm(5000,0,1)) is equivalent to rnorm(1,0,sqrt(5000)). The sample average of normals is again the normal variable. In your case you take a sample average of 5000 independent normal variables with zero mean and variance 5000. This is a normal variable with zero mean and unit variance, i.e. the standard normal.
So in your case mean(sums) is identical to rnorm(1). So any value from interval (-1.96,1.96) will come up 95% of the time.

Related

Why does the difference in calculation confidence interval occur?

I am about to calculate the confidence interval(CI) for the proportion.
The data is like this:
End Count
death 57
pat 319
where pat means the total number of sample.
I used following formula:
#lower CI
57/319 - 1.96*sqrt(57/319*(1-57/319)/319)
#upper CI
57/319 + 1.96*sqrt(57/319*(1-57/319)/319)
Formulas above gave the result of [0.1366, 0.2207].
However, when I used prop.test(),
prop.test(57, 319, correct = FALSE)
The result was [0.1405442, 0.2244692].
Could you please explain how this happen?
Thank you in advance.
Your confidence intervals are an approximation assuming a normal distribution. For count data and especially for proportions, this assumption can be very inaccurate, especially if the proportion is not approximately .5 or the sample size is small. The 'prop.test()' function estimates an asymmetric interval (notice that 57/319 does not lie in the middle of the interval). The method used is documented in the article cited on the manual page (?prop.test). The manual page also notes that the estimate used in binom.test can be more accurate:
binom.test(57, 319)
Exact binomial test
data: 57 and 319
number of successes = 57, number of trials = 319, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.1382331 0.2252183
sample estimates:
probability of success
0.1786834
The difference between all three is small, but binom.test() should be your choice if it makes a difference in whether or not to reject the null hypothesis.

Calculating significance of two variables in a dataset for every column

I would like to find out how to create a column of p-values to check the significance of the variable for every observation. I would like to check the p values for the two columns on the right side of the data set. I think the most efficient way to do so is to calculate a t.test for every column but i don't know how to do so.
This is what i tried. But this didn't give me the significance of every table.
t.test(Elasapp,Elashuis,var.equal=TRUE)
Results:
Two Sample t-test
data: Elasapp and Elashuis
t = 41.674, df = 48860, p-value \< 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0\.07461778 0.08198304
sample estimates:
mean of x mean of y
0\.085672044 0.007371636

Majority observations outside confidence interval

I have a time series (called sigma.year), with a length of 100. Its histogram and qqplot shows strong evidence for normality.
I am calculating confidence interval (in R) for sample mean as follows:
exp.sigma<-mean(sigma.year)
sd.sigma<-sd(sigma.year)
se.sigma=sd.sigma/sqrt(length(sigma.year))
me.sigma=qt(.995,df=length(sigma.year)-1)*se.sigma
low.sigma=exp.sigma-me.sigma
up.sigma=exp.sigma+me.sigma
My problem is 83/100 observations falls outside the confidence interval. Do you have any idea why I have this so? Is this because I have time-series, rather than cross section data? Or, am I calculating conf interval in a wrong way?
Thanks.
It's hard to evaluate completely without knowing all of your inputs (for example, a dput of sigma.year), but your confidence interval appears to be a confidence interval for the mean. So it is not unexpected that 83/100 observations are outside of a 99% confidence interval about the mean.
To clarify. If sd.sigma is the standard deviation of your sample, then you have correctly calculated the 99% confidence interval about the mean.
And again, your data are behaving as you'd expect for a sample of 100 observations drawn from a population with a normal distribution. Here's some code to check that:
x <- rnorm(100)
exp.x <- mean(x)
se.x <- sd(x)/sqrt(length(x))
q.x <- qt(0.995, df = length(x)-1)
interval <- c(exp.x - se.x*q.x, exp.x + se.x*q.x)
sum(x > interval[1] & x < interval[2])
# will vary, because I didn't set the seed on purpose, but try this
# you'll get a value around 20

Gamma equivalent to standard deviations

I have a gamma distribution fit to my data using libary(fitdistrplus). I need to determine a method for defining the range of x values that can be "reasonably" expected, analogous to using standard deviations with normal distributions.
For example, x values within two standard deviations from the mean could be considered to be the reasonable range of expected values from a normal distribution. Any suggestions for how to define a similar range of expected values based on the shape and rate parameters of a gamma distribution?
...maybe something like identifying the two values of x that between which contains 95% of the data?
Let's assume we have a random variable that is gamma distributed with shape alpha=2 and rate beta=3. We would expect this distribution to have mean 2/3 and standard deviation sqrt(2)/3, and indeed we see this in simulated data:
mean(rgamma(100000, 2, 3))
# [1] 0.6667945
sd(rgamma(100000, 2, 3))
# [1] 0.4710581
sqrt(2) / 3
# [1] 0.4714045
It would be pretty weird to define confidence ranges as [mean - gamma*sd, mean + gamma*sd]. To see why, consider if we selected gamma=2 in the example above. This would yield confidence range [-0.276, 1.609], but the gamma distribution can't even take on negative values, and 4.7% of data falls above 1.609. This is at the very least not a well balanced confidence interval.
A more natural choice might by to take the 0.025 and 0.975 percentiles of the distribution as a confidence range. We would expect 2.5% of data to fall below this range and 2.5% of data to fall above the range. We can use qgamma to determine that for our example parameters the confidence range would be [0.081, 1.857].
qgamma(c(0.025, 0.975), 2, 3)
# [1] 0.08073643 1.85721446
The mean expected value of a gamma is:
E[X] = k * theta
The variance is Var[X] = k * theta^2 where, k is shape and theta is scale.
But typically I would use 95% quantiles to indicate data spread.

Difference between Hmisc wtd.var and SAS proc Mean generated weighted variance

I'm getting different results from R and SAS when I try to calculate a weighted variance. Does anyone know what might be causing this difference?
I create vectors of weights and values and I then calculate the weighted variance using the
Hmisc library wtd.var function:
library(Hmisc)
wt <- c(5, 5, 4, 1)
x <- c(3.7,3.3,3.5,2.8)
wtd.var(x,weights=wt)
I get an answer of:
[1] 0.0612381
But if I try to reproduce these results in SAS I get a quite different result:
data test;
input wt x;
cards;
5 3.7
5 3.3
4 3.5
1 2.8
;
run;
proc means data=test var;
var x;
weight wt;
run;
Results in an answer of
0.2857778
You probably have a difference in how the variance is calculated. SAS gives you an option, VARDEF, which may help here.
proc means data=test var vardef=WDF;
var x;
weight wt;
run;
That on your dataset gives a variance similar to r. Both are 'right', depending on how you choose to calculate the weighted variance. (At my shop we calculate it a third way, of course...)
Complete text from PROC MEANS documentation:
VARDEF=divisor specifies the divisor to use in the calculation of the
variance and standard deviation. The following table shows the
possible values for divisor and associated divisors.
Possible Values for VARDEF=
Value Divisor Formula for Divisor
DF degrees of freedom n - 1
N number of observations n
WDF sum of weights minus one ([Sigma]iwi) - 1
WEIGHT | WGT sum of weights [Sigma]iwi
The procedure computes the variance as CSS/Divisor, where CSS
is the corrected sums of squares and equals Sum((Xi-Xbar)^2). When you
weight the analysis variables, CSS equals sum(Wi*(Xi-Xwbar)^2), where
Xwbar is the weighted mean.
Default: DF Requirement: To compute the standard error of the mean,
confidence limits for the mean, or the Student's t-test, use the
default value of VARDEF=.
Tip: When you use the WEIGHT statement and
VARDEF=DF, the variance is an estimate of Sigma^2, where the
variance of the ith observation is Sigma^2/wi and wi is the
weight for the ith observation. This method yields an estimate of the
variance of an observation with unit weight.
Tip: When you use the
WEIGHT statement and VARDEF=WGT, the computed variance is
asymptotically (for large n) an estimate of Sigma^2/wbar, where
wbar is the average weight. This method yields an asymptotic
estimate of the variance of an observation with average weight.

Resources