trying to compare two distributions - r

I found this code on internet that compares a normal distribution to different student distributions:
x <- seq(-4, 4, length=100)
hx <- dnorm(x)
degf <- c(1, 3, 8, 30)
colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")
plot(x, hx, type="l", lty=2, xlab="x value",
ylab="Density", main="Comparison of t Distributions")
for (i in 1:4){
lines(x, dt(x,degf[i]), lwd=2, col=colors[i])
}
I would like to adapt this to my situation where I would like to compare my data to a normal distribution. This is my data:
library(quantmod)
getSymbols("^NDX",src="yahoo", from='1997-6-01', to='2012-6-01')
daily<- allReturns(NDX) [,c('daily')]
dailySerieTemporel<-ts(data=daily)
ss<-na.omit(dailySerieTemporel)
The objectif being to see if my data is normal or not... Can someone help me out a bit with this ? Thank you very much I really appreciate it !

If you are only concern about knowing if your data is normal distributed or not, you can apply the Jarque-Bera test. This test states that under the null your data is normal distributed, see details here. You can perform this test using jarque.bera.test function.
library(tseries)
jarque.bera.test(ss)
Jarque Bera Test
data: ss
X-squared = 4100.781, df = 2, p-value < 2.2e-16
Clearly, from the result, you can see that your data is not normaly distributed since the null has been rejected even at 1%.
To see why your data is not normaly distributed you can take a look at the descriptive statistics:
library(fBasics)
basicStats(ss)
ss
nobs 3776.000000
NAs 0.000000
Minimum -0.105195
Maximum 0.187713
1. Quartile -0.009417
3. Quartile 0.010220
Mean 0.000462
Median 0.001224
Sum 1.745798
SE Mean 0.000336
LCL Mean -0.000197
UCL Mean 0.001122
Variance 0.000427
Stdev 0.020671
Skewness 0.322820
Kurtosis 5.060026
From the last two rows, one can realize that ss has an excess of kurtosis, and the skewness is not zero. This is the basis of the Jarque-Bera test.
But if you are interested in compare actual distribution of your data agaist a normal distibuted random variable with the same mean and variance as your data, you can first estimate the empirical density function from your data using a kernel and then plot it, finally you only have to generate a normal random variable with same mean and variance as you data, do something like this:
plot(density(ss, kernel='epanechnikov'))
set.seed(125)
lines(density(rnorm(length(ss), mean(ss), sd(ss)), kernel='epanechnikov'), col=2)
In this fashion you can generate other curve from another probability distribution.
The tests suggested by #Alex Reynolds will help you if your interest is to know what possible distribution your data were drawn from. If this is your goal you can take a look at any goodness-of-it test in any statistics texbook. Nevertheless, if just want to know if your variable is normally distributed then Jarque-Bera test is good enough.

Take a look at Q-Q, Shapiro-Wilk or K-S tests to see if your data are normally distributed.

Related

Why bootstrapping tecnique is not being accurate for my non normal distribution?

My code:
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sample1 <- sample(population1, 30, replace=F)
mean(sample1)
sample_bs <- replicate(200, mean(sample(sample1, 30, replace=T)))
mean(sample_bs)
gmodels::ci(sample_bs)
My results:
population mean: 0.6
sample mean: 0.4666
results from my boostrapping procedures:
Estimate CI lower CI upper Std. Error
0.467500000 0.450542365 0.484457635 0.008599396
So my question is: why the results from this procedures still far from the original population?
You are not sampling from your original population, you are sampling from your first sample! So the mean will be close to the mean of sample1, not population1.
Try this instead, and you will see how both results are close:
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sample1 <- sample(population1, 30, replace=F)
mean(sample1)
sample_bs <- replicate(200, mean(sample(population1, 30, replace=T)))
mean(sample_bs)
gmodels::ci(sample_bs)
Your main issue is that the sample size is too small. We can make some educated guesses at the precision you can expect from a sample size of 30, and what sample size you'd need for a given expected precision.
We will assume a normal distribution, but for this to be reasonable, the distribution of what you want to measure needs to be reasonably normal. That is, the distribution of means needs to be reasonably normal, the distribution of a given sample needs not.
So…
set.seed(1234)
population1 <- rpois(1000000, 0.6)
sm30 <- scale(replicate(10000, mean(sample(population1, 30, rep=FALSE))))
s <- seq(-3, 3, by=0.01)
plot(s, dnorm(s, 0, 1), type="l", col="grey80", lwd=9)
lines(density(sm30), col=3, lwd=2)
That looks fairly normal (ish) to me. Some skew, but close enough for government work, as they say (CERN might demand better)
Assuming this is normal enough, we can continue estimating (this is Stat101 stuff):
z <- qnorm(1-(0.05/2))
0.6 + z*c(-1, 1)*sqrt(0.6/30)
# 0.32281924 0.87718076
With a confidence level of 5% and a sample size of 30, we get some pretty wide confidence intervals (CI), as you experienced.
By rearranging the expression a bit, we can figure out what sample size we need for a given precision at a given confidence level. If we keep the 5% and say we want the CI to be at 0.6±0.1 we get:
(n <- ceiling((z*sqrt(0.6)/0.1)^2))
# 231
0.6 + z*c(-1, 1)*sqrt(0.6/n)
# 0.50011099 0.69988901
That's a bit more than 30.
While bootstrapping can help you estimating confidence intervals of non-normal distributions, it can't make up for missing information. The amount of information is inherent in the variance in your data, and the size of your sample.

Fit distribution to given frequency values in R

I have frequency values changing with the time (x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution.
Q: Assuming that these frequency points are from Weibull distribution T, how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it?
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
plot(1:length(sample), sample, type = "l")
points(1:length(sample), sample)
Update.
To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time (x axis units) I mean I have data which says that I have:
7787 realizations of value 1
3056 realizations of value 2
2359 realizations of value 3 ... etc.
Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations:
# Loop to simulate values
set.values <- c()
for(i in 1:length(sample)){
set.values <<- c(set.values, rep(i, times = sample[i]))
}
hist(set.values)
lines(1:length(sample), sample)
points(1:length(sample), sample)
and use fitdistr on the set.values:
f2 <- fitdistr(set.values, 'weibull')
f2
Why I think it is incorrect way and why I am looking for a better solution in R?
in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T
in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function)
Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over.
Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot.
If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot. I think this is a good fit, it is in any case a local minimum in the interior of the constraining box.
library(optimx)
sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
t.sample <- 0:22
s.fit <- sample[3:23]
t.fit <- t.sample[3:23]
wx <- function(param) {
res <- param[2]*dweibull(t.fit*param[3],shape=param[1])
return(res)
}
minwx <- function(param){
v <- s.fit-wx(param)
sqrt(sum(v*v))
}
p0 <- c(1,200,1/20)
paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1))
popt <- paramopt$par
popt
rms <- paramopt$value
tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms)
plot(t.sample[2:23], sample[2:23], type = "p",col="darkred")
lines(t.fit, wx(popt),col="blue")
title(main=tit)
You can directly calculate the maximum likelihood parameters, as described here.
# Defining the error of the implicit function
k.diff <- function(k, vec){
x2 <- seq(length(vec))
abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2),
w = x2^k*sample))
}
# Setting the error to "quite zero", fulfilling the equation
k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min
# Calculate lambda, given k
l <- weighted.mean(seq(length(sample))^k, w = sample)
# Plot
plot(density(rep(seq(length(sample)),sample)))
x <- 1:25
lines(x, dweibull(x, shape=k, scale= l))
Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this:
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
f<-fitdistr(sample, 'weibull')
f
If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best.
For your example this would look like this:
ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2])
ks
The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution.
Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option.
f<-fitdistr(sample, 'weibull')
z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2])
hist(z)
f<-fitdistr(sample, 'exponential')
z = rexp(10000, f$estimate[1])
hist(z)

How to fit a normal cumulative distribution function to data

I have generated some data which is effectively a cumulative distribution, the code below gives an example of X and Y from my data:
X<- c(0.09787761, 0.10745590, 0.11815422, 0.15503521, 0.16887488, 0.18361325, 0.22166727,
0.23526786, 0.24198808, 0.25432602, 0.26387961, 0.27364063, 0.34864672, 0.37734113,
0.39230736, 0.40699061, 0.41063824, 0.42497043, 0.44176913, 0.46076456, 0.47229330,
0.53134509, 0.56903577, 0.58308938, 0.58417653, 0.60061901, 0.60483849, 0.61847521,
0.62735245, 0.64337353, 0.65783302, 0.67232004, 0.68884473, 0.78846000, 0.82793293,
0.82963446, 0.84392010, 0.87090024, 0.88384044, 0.89543314, 0.93899033, 0.94781219,
1.12390279, 1.18756693, 1.25057774)
Y<- c(0.0090, 0.0210, 0.0300, 0.0420, 0.0580, 0.0700, 0.0925, 0.1015, 0.1315, 0.1435,
0.1660, 0.1750, 0.2050, 0.2450, 0.2630, 0.2930, 0.3110, 0.3350, 0.3590, 0.3770, 0.3950,
0.4175, 0.4475, 0.4715, 0.4955, 0.5180, 0.5405, 0.5725, 0.6045, 0.6345, 0.6585, 0.6825,
0.7050, 0.7230, 0.7470, 0.7650, 0.7950, 0.8130, 0.8370, 0.8770, 0.8950, 0.9250, 0.9475,
0.9775, 1.0000)
plot(X,Y)
I would like to obtain the median, mean and some quantile information (say for example 5%, 95%) from this data. The way I was thinking of doing this was to fit a defined distribution to it and then integrate to get my quantiles, mean and median values.
The question is how to fit the most appropriate cumulative distribution function to this data (I expect this may well be the Normal Cumulative Distribution Function).
I have seen lots of ways to fit a PDF but I can't find anything on fitting a CDF.
(I realise this may seem a basic question to many of you but it has me struggling!!)
Thanks in advance
Perhaps you could use nlm to find parameters that minimize the squared differences from your observed Y values and the expected for a normal distribution. Here an example using your data
fn <- function(x) {
mu <- x[1];
sigma <- exp(x[2])
sum((Y-pnorm(X,mu,sigma))^2)
}
est <- nlm(fn, c(1,1))$estimate
plot(X,Y)
curve(pnorm(x, est[1], exp(est[2])), add=T)
Unfortunately I don't know an easy with with this method to constrain sigma>0 without doing the exp transformation on the variable. But the fit seems reasonable

Finding distribution of sample mean by central limit theorem

let X1,...,X25 be a random sampe from normal distribution with mean=37 and sd=45.
Let xbar be the sample mean. How is xbar distributed? I have to verify it by central limit theorem.
Also compute P(xbar>43.1)
my attempt
for(i in 1:1000){
x=rnorm(25,mean=37,sd=45)
xbar=mean(x)
z=(xbar-37)/(45/sqrt(25))
}
z
But i couldn't find the distribution of xbar.
Change your for loop and use replicate instead
set.seed(1)
X <- replicate(1000, rnorm(25,mean=37,sd=45))
X_bar <- colMeans(X)
hist(X_bar) # this is how the distribution of X_bar looks like
xbar=c()
for(i in 1:1000){
x=rnorm(25,mean=37,sd=45)
xbar=c(xbar,mean(x)) #save every time the value of xbar
}
hist(xbar) #plot the hist of xbar
#compute the probability to b e bigger thant 43.1
prob=which(xbar>43.1)/length(xbar)
Just to expand in this a little bit.
The Central Limit Theorem states the distribution of the mean is asymptotically N[mu, sd/sqrt(n)]. Where mu and sd are the mean and standard deviation of the underlying distribution, and n is the sample size used in calculating the mean. So, in the example below data is a dataset of size 2500 drawn from N[37,45], arbitrarily segmented into 100 groups of 25. means is a dataset of the means of each group. Note that both the data and the means are (aprox.) normally distributed, but the distribution of the means is much tighter (lower sigma). From the CLT we expect sd(mean) ~ sd(data)/sqrt(25), which it is.
data <- data.frame(sample=rep(1:100,each=25),x = rnorm(2500,mean=37,sd=45))
means <- aggregate(data$x,by=list(data$sample),mean)
#plot histoggrams
par(mfrow=c(1,2))
hist(data$x,main="",sub="Histogram of Underlying Data",xlim=c(-150,200))
hist(means$x,main="",sub="Histogram of Means", xlim=c(-150,200))
mtext("Underlying Data ~ N[37,45]",outer=T,line=-3)
c(sd.data=sd(data$x), sd.means=sd(means$x))
sd.data sd.means
43.548570 7.184518
But the real power of the CLT is that it shows that the distribution of the means is asymptotically normal, regardless of the distribution of the underlying data. This is shown here, where the underlying data is sampled from a uniform distribution. Again, sd(mean) ~ sd(data)/sqrt(25).
data <- data.frame(sample=rep(1:100,each=25),x = runif(2500,min=-150, max=200))
means <- aggregate(data$x,by=list(data$sample),mean)
#plot histoggrams
par(mfrow=c(1,2))
hist(data$x,main="",sub="Histogram of Underlying Data",xlim=c(-150,200))
hist(means$x,main="",sub="Histogram of Means", xlim=c(-150,200))
mtext("Underlying Data ~ U[-150,200]",outer=T,line=-3)
c(sd.data=sd(data$x), sd.means=sd(means$x))
sd.data sd.means
99.7800 18.8176

Programming a QQ plot

I have a sample of math test scores for male and female students. I want to draw QQ plot for each gender to see if each of them is normally distributed. I know how to draw the QQ plot for the overall sample, but how can I draw them separately?
Here is a simple solution using base graphics:
scores <- rnorm(200, mean=12, sd=2)
gender <- gl(2, 50, labels=c("M","F"))
opar <- par(mfrow=c(1,2))
for (g in levels(gender))
qqnorm(scores[gender==g], main=paste("Gender =", g))
par(opar)
A more elegant lattice solution then:
qqmath(~ scores | gender, data=data.frame(scores, gender), type=c("p", "g"))
See the on-line help for qqmath for more discussion and example of possible customization.
In Python, you have a QQplot method offered by the OpenTURNS Library see doc here. Here is an example.
In a first step, we generate a random sample of size 300 from a Uniform distribution.
In a second step, we consider that we do not know where this sample comes from and try to fit a Normal distribution and a Uniform distribution.
In a third step, we draw the QQPlot of ;the sample against each of the fitted distributions in order to "see" which one is the best
1st step:
import openturns as ot
from openturns.viewer import View
distribution = ot.Uniform(-1, 1)
sample = distribution.getSample(300)
2nd step:
fitted_normal = ot.NormalFactory().build(sample)
fitted_uniform = ot.UniformFactory().build(sample)
3rd step:
QQ_plot1 = ot.VisualTest.DrawQQplot(sample, fitted_normal)
QQ_plot2 = ot.VisualTest.DrawQQplot(sample,fitted_uniform)
View(QQ_plot1)
View(QQ_plot2)
As expected, the fitted Uniform is more adapted to the sample the Normal which has bigger error at both ends of the domain.

Resources