How do I test the validity of my normal distribution quantitatively? - r

I have tried to determine the validity of a normal distribution which I obtained in the following way:
xfit<-seq(min(x),max(x),length=1000)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
In this excerpt of my code x is an array with ~300 values between -15 and -5.
I evaluated the distribution by comparing it to a histogram and a Q-Q plot in the following way:
#Fitting in histogram with normal curve
x <- dataM31
h<-hist(x, breaks=10, col="red", xlab="Absolute Magnitude", prob = TRUE)
xfit<-seq(min(x),max(x),length=1000)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
lines(xfit, yfit, col="blue", lwd=2)
#Create a normality test plot (qq-plot)
x_norm <- (x - mean(x))/sd(x)
qqnorm(x_norm); abline(0,1)
I have tried to perform any kind of quantitative test for the validity of this normal distribution, but I cannot find a function that does this for a 1D array of data vs. a 2D normal distribution. Is there any function that could do this for me with relative ease or should I completely rewrite my code to get this to work?

These functions provide the solutions for normality test and density estimation, run that in the console and see examples:
?shapiro.test()
?ks.test()
?density()

Related

R generalized hyperbolic distribution plot line in hist

I have the following code:
x.grid <- seq(-0.05,to=0.05,by=0.001)
hist(df$dataset.d.return.aex, breaks=100, main='daily returns')
Which results in the following plot:
[
Now, I want to draw the implied marginal distributions in histograms of the daily returns, where I use a generalized hyperbolic distribution. I tried to do it with the following code:
ghypuv <- fit.ghypuv(data = returns[,"d.return.aex"], symmetric = TRUE)
lines(ghypuv, col="blue")
However, this results in the following plot:
So, my question is, how to properly draw the implied marginal distributions (using a generalized hyperbolic distribution) in histograms of the daily returns?
Try hist(..., freq=FALSE) to use density rather than number of counts on the y-axis.
(This is probably a duplicate ...)

95% confidence interval for smooth.spline in R [duplicate]

I have used smooth.spline to estimate a cubic spline for my data. But when I calculate the 90% point-wise confidence interval using equation, the results seems to be a little bit off. Can someone please tell me if I did it wrongly? I am just wondering if there is a function that can automatically calculate a point-wise interval band associated with smooth.spline function.
boneMaleSmooth = smooth.spline( bone[males,"age"], bone[males,"spnbmd"], cv=FALSE)
error90_male = qnorm(.95)*sd(boneMaleSmooth$x)/sqrt(length(boneMaleSmooth$x))
plot(boneMaleSmooth, ylim=c(-0.5,0.5), col="blue", lwd=3, type="l", xlab="Age",
ylab="Relative Change in Spinal BMD")
points(bone[males,c(2,4)], col="blue", pch=20)
lines(boneMaleSmooth$x,boneMaleSmooth$y+error90_male, col="purple",lty=3,lwd=3)
lines(boneMaleSmooth$x,boneMaleSmooth$y-error90_male, col="purple",lty=3,lwd=3)
Because I am not sure if I did it correctly, then I used gam() function from mgcv package.
It instantly gave a confidence band but I am not sure if it is 90% or 95% CI or something else. It would be great if someone can explain.
males=gam(bone[males,c(2,4)]$spnbmd ~s(bone[males,c(2,4)]$age), method = "GCV.Cp")
plot(males,xlab="Age",ylab="Relative Change in Spinal BMD")
I'm not sure the confidence intervals for smooth.spline have "nice" confidence intervals like those form lowess do. But I found a code sample from a CMU Data Analysis course to make Bayesian bootstap confidence intervals.
Here are the functions used and an example. The main function is spline.cis where the first parameter is a data frame where the first column are the x values and the second column are the y values. The other important parameter is B which indicates the number bootstrap replications to do. (See the linked PDF above for the full details.)
# Helper functions
resampler <- function(data) {
n <- nrow(data)
resample.rows <- sample(1:n,size=n,replace=TRUE)
return(data[resample.rows,])
}
spline.estimator <- function(data,m=300) {
fit <- smooth.spline(x=data[,1],y=data[,2],cv=TRUE)
eval.grid <- seq(from=min(data[,1]),to=max(data[,1]),length.out=m)
return(predict(fit,x=eval.grid)$y) # We only want the predicted values
}
spline.cis <- function(data,B,alpha=0.05,m=300) {
spline.main <- spline.estimator(data,m=m)
spline.boots <- replicate(B,spline.estimator(resampler(data),m=m))
cis.lower <- 2*spline.main - apply(spline.boots,1,quantile,probs=1-alpha/2)
cis.upper <- 2*spline.main - apply(spline.boots,1,quantile,probs=alpha/2)
return(list(main.curve=spline.main,lower.ci=cis.lower,upper.ci=cis.upper,
x=seq(from=min(data[,1]),to=max(data[,1]),length.out=m)))
}
#sample data
data<-data.frame(x=rnorm(100), y=rnorm(100))
#run and plot
sp.cis <- spline.cis(data, B=1000,alpha=0.05)
plot(data[,1],data[,2])
lines(x=sp.cis$x,y=sp.cis$main.curve)
lines(x=sp.cis$x,y=sp.cis$lower.ci, lty=2)
lines(x=sp.cis$x,y=sp.cis$upper.ci, lty=2)
And that gives something like
Actually it looks like there might be a more parametric way to calculate confidence intervals using the jackknife residuals. This code comes from the S+ help page for smooth.spline
fit <- smooth.spline(data$x, data$y) # smooth.spline fit
res <- (fit$yin - fit$y)/(1-fit$lev) # jackknife residuals
sigma <- sqrt(var(res)) # estimate sd
upper <- fit$y + 2.0*sigma*sqrt(fit$lev) # upper 95% conf. band
lower <- fit$y - 2.0*sigma*sqrt(fit$lev) # lower 95% conf. band
matplot(fit$x, cbind(upper, fit$y, lower), type="plp", pch=".")
And that results in
And as far as the gam confidence intervals go, if you read the print.gam help file, there is an se= parameter with default TRUE and the docs say
when TRUE (default) upper and lower lines are added to the 1-d plots at 2 standard errors above and below the estimate of the smooth being plotted while for 2-d plots, surfaces at +1 and -1 standard errors are contoured and overlayed on the contour plot for the estimate. If a positive number is supplied then this number is multiplied by the standard errors when calculating standard error curves or surfaces. See also shade, below.
So you can adjust the confidence interval by adjusting this parameter. (This would be in the print() call.)
The R package mgcv calculates smoothing splines and Bayesian "confidence intervals." These are not confidence intervals in the usual (frequentist) sense, but numerical simulations have shown that there is almost no difference; see the linked paper by Marra and Wood in the help file of mgcv.
library(SemiPar)
data(lidar)
require(mgcv)
fit=gam(range~s(logratio), data = lidar)
plot(fit)
with(lidar, points(logratio, range-mean(range)))

Fitting a lognormal or poisson distribution

I have a vector of 1096 numbers, the daily average concentration of NOx measured in 3 years in a measurement station.
You can observe the type of distribution in the image:
I used these commands to do the histogram:
NOxV<-scan("NOx_Vt15-17.txt")
hist.NOxVt<-hist(NOxV, plot = FALSE, breaks = 24)
plot(hist.NOxVt, xlab = "[NOx]", ylab = "Frequenze assolute", main = "Istogramma freq. ass. NOx 15-17 Viterbo")
points(hist.NOxVt$mids, hist.NOxVt$counts, col= "red")
My professor suggested that I fit the histogram with a Poisson distribution - paying attention to the transition: discrete -> continuous (I don't know what that means)- or with a "Lognormal" distribution.
I tried to do the Poisson fit, with some command lines that she gave us at lesson, but R gave me an error after having executed the last code line of the following:
my_poisson = function(params, x){
exp(-params)*params^x/factorial(x)
}
y<-hist.NOxVt$counts/1096;
x<-hist.NOxVt$mids;
z <- nls( y ~ exp(-a)*a^x/factorial(x), start=list(a=1) )
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
In addition: There were 50 or more warnings (use warnings() to see the first 50)"
After this problem I couldn't solve (even searching similar problems on the internet) I decided to fit the distribution with a Lognormal, but I have no idea how to do it, because the professor did not explain it to us, and I still don't have enough R experience to figure it out on my own.
I would appreciate any suggestion or examples of how to do a lognormal fit and/or Poisson fit.
There's a built-in function fitdistr in the MASS package that comes with R:
Generating a data example to look at (eyeballing parameters to get something similar to your picture):
set.seed(101)
z <- rlnorm(1096,meanlog=4.5,sdlog=0.8)
Fit (on statistical grounds I wouldn't recommend a Poisson fit - it might be possible to adapt a discrete distribution such as Poisson (or better, negative binomial) to fit such continuous data, but log-Normal or Gamma distributions are more natural choices.
library(MASS)
f1 <- fitdistr(z,"lognormal")
f2 <- fitdistr(z,"Gamma")
The f1 and f2 objects, when printed, give the estimated coefficients (meanlog and sdlog for log-Normal, shape and rate for Gamma) and standard errors of the coefficients.
Draw a picture (on the density scale, not the count scale): red is log-Normal, blue is Gamma (in this case log-Normal fits better because that's how I generated the "data" in the first place). [The with(as.list(coef(...)) stuff is some R fanciness to allow the use of the names of the coefficients (meanlog, sdlog etc.) in the subsequent R code.]
hist(z,col="gray",breaks=50,freq=FALSE)
with(as.list(coef(f1)),
curve(dlnorm(x,meanlog,sdlog),
add=TRUE,col="red",lwd=2))
with(as.list(coef(f2)),
curve(dgamma(x,shape=shape,rate=rate),
add=TRUE,col="blue",lwd=2))

Specificity of ROC curve plotting in reverse direction

I wish to plot the ROC curve for a SVM classifier I have built but when I plot my data, the x axis (specificity) is plotting from 1.0 -> -1.0, see the image below.
In order to plot this I used the following:
> plot(roc(predictor = fit.down.Kernel$pred$Overshooting, response = fit.down.Kernel$pred$obs))
where fit.down.Kernel is my model, Overshooting is the target feature I wish to predict.
Obviously I have gone about this the wrong way, can anyone point me in the right direction please?
Ultimately I have a bunch of models which I have trained using a variety of different datasets (upsampled, downsampled...) and I wish to visually compare their performance using the ROC curve. I guess I need to get the axis working properly before proceeding to multiple plots.
You can use ROCR package in R. Refer to a code below and use with your Predictions vs actual results.
Prob.mod are predictions from various models ( 1, 2, 3) & y.test is your actual Overshooting
Use Prediction function from ROCR
prediction.mod1 <- prediction(prob.mod1, y.test)
prediction.mod2 <- prediction(prob.mod2, y.test)
prediction.mod3 <- prediction(prob.mod3, y.test)
Calculating AUC
auc.mod1=performance(prediction.mod1, "auc")#y.values)
auc.mod2=performance(prediction.mod2, "auc")#y.values)
auc.mod3=performance(prediction.mod3, "auc")#y.values)
Plot AUCs
plot(auc.mod1, ylim=c(0.1, 1))
plot(auc.mod2, col=2, add=TRUE)
plot(auc.mod3, col=3, add=TRUE)

fitting a distribution graphically

I am running some tests to try and determine what distribution my data follows. By the look of the density of my data I thought it looked a bit like a logistic distribution. I than used the package MASS to estimate the parameters of the distribution. However when I graph them together although better than the normal, the logistic is still not very good..Is there a way to find what distribution would go better? Thank you for the help !
library(quantmod)
getSymbols("^NDX",src="yahoo", from='1997-6-01', to='2012-6-01')
daily<- allReturns(NDX) [,c('daily')]
dailySerieTemporel<-ts(data=daily)
x<-na.omit(dailySerieTemporel)
library(MASS)
(xFit<-fitdistr(x,"logistic"))
# location scale
# 0.0005210570 0.0106366354
# (0.0002941922) (0.0001444678)
xFitEst<-coef(xFit)
plot(density(x))
set.seed(125)
lines(density(rlogis(length(x), xFitEst['location'], xFitEst['scale'])), col=3)
lines(density(rnorm(length(x), mean(x), sd(x))), col=2)
This is elementary R: plot() creates a new plotting canvas by default, and you should use a command such as lines() to add to an existing plot.
This works for your example:
plot(density(x))
lines(density(rlogis(length(x), location = 0.0005210570,
scale = 0.0106366354)), col="blue")
as it adds the estimated logistic fit in blue to your existing plot.

Resources