obtaining quantiles from complete gaussian fit of data in R - r

I have been struggling with how R calculates quantiles and the normal fitting of data.
I have data (NDVI values) that follows a truncated normal distribution (see figure)
I am interested in getting the lowest 10th percentile value (p=0.1) from the data and from the fitting normal distribution curve.
In my understanding, because the data is truncated, the two should be quite different: I expect the quantile from the data to be higher than the one calculated from the normal distribution, but this is not so. For what I understand of the quantile function help the quantile from the data should be the default quantile function:
q=quantile(y, p=0.1)
while the quantile from the normal distribution is :
qx=quantile(y, p=0.1, type=9)
However the two result very close in all cases, which makes me wonder to what type of distribution does R fit the data to calculate the quantile (truncated normal dist.?)
I have also tried to calculate the quantile based on the fitting normal curve as:
fitted=fitdist(as.numeric(y), "norm", discrete = T)
fit.q=as.numeric(quantile(fitted, p=0.1)[[1]][1])
but obtaining no difference.
So my questions are:
To what curve does R fit the data for calculating quantiles, in particular for type=9 ? How can I calculate the quantile based on the complete normal distribution (including the lower tail)?
I don't know how to generate a reproducible example for this, but the data is available at https://dl.dropboxusercontent.com/u/26249349/data.csv
Thanks!

R is using the empirical ordering of the data when determining quantiles, rather than assuming any particular distribution.
The 10th percentile for your truncated data and a normal distribution fit to your data happen to be pretty close, although the 1st percentile is quite a bit different. For example:
# Load data
df = read.csv("data.csv", header=TRUE, stringsAsFactors=FALSE)
# Fit a normal distribution to the data
df.dist = fitdist(df$x, "norm", discrete = T)
Now let's get quantiles of the fitted distribution and the original data. I've included the 1st percentile in addition to the 10th percentile. You can see that the fitted normal distribution's 10th percentile is just a bit lower than that of the data. However, the 1st percentile of the fitted normal distribution is much lower.
quantile(df.dist, p=c(0.01, 0.1))
Estimated quantiles for each specified probability (non-censored data)
p=0.01 p=0.1
estimate 1632.829 2459.039
quantile(df$x, p=c(0.01, 0.1))
1% 10%
2064.79 2469.90
quantile(df$x, p=c(0.01, 0.1), type=9)
1% 10%
2064.177 2469.400
You can also see this by direct ranking of the data and by getting the 1st and 10th percentiles of a normal distribution with mean and sd equal to the fitted values from fitdist:
# 1st and 10th percentiles of data by direct ranking
df$x[order(df$x)][round(c(0.01,0.1)*5780)]
[1] 2064 2469
# 1st and 10th percentiles of fitted distribution
qnorm(c(0.01,0.1), df.dist$estimate[1], df.dist$estimate[2])
[1] 1632.829 2459.039
Let's plot histograms of the original data (blue) and of fake data generated from the fitted normal distribution (red). The area of overlap is purple.
# Histogram of data (blue)
hist(df$x, xlim=c(0,8000), ylim=c(0,1600), col="#0000FF80")
# Overlay histogram of random draws from fitted normal distribution (red)
set.seed(685)
set.seed(685)
x.fit = rnorm(length(df$x), df.dist$estimate[1], df.dist$estimate[2])
hist(x.fit, add=TRUE, col="#FF000080")
Or we can plot the empirical cumulative distribution function (ecdf) for the data (blue) and the random draws from the fitted normal distribution (red). The horizontal grey line marks the 10th percentile:
plot(ecdf(df$x), xlim=c(0,8000), col="blue")
lines(ecdf(x.fit), col="red")
abline(0.1,0, col="grey40", lwd=2, lty="11")
Now that I've gone through this, I'm wondering if you were expecting fitdist to return the parameters of the normal distribution we would have gotten had your data really come from a normal distribution and not been truncated. Rather, fitdist returns a normal distribution with the mean and sd of the (truncated) data at hand, so the distribution returned by fitdist is shifted to the right compared to where we might have "expected" it to be.
c(mean=mean(df$x), sd=sd(df$x))
mean sd
3472.4708 790.8538
df.dist$estimate
mean sd
3472.4708 790.7853
Or, another quick example: x is normally distributed with mean ~ 0 and sd ~ 1. xtrunc removes all values less than -1, and xtrunc.dist is the output of fitdist on xtrunc:
set.seed(55)
x = rnorm(6000)
xtrunc = x[x > -1]
xtrunc.dist = fitdist(xtrunc, "norm")
round(cbind(sapply(list(x=x,xtrunc=xtrunc), function(x) c(mean=mean(x),sd=sd(x))),
xtrunc.dist=xtrunc.dist$estimate),3)
x xtrunc xtrunc.dist
mean -0.007 0.275 0.275
sd 1.009 0.806 0.806
And you can see in the ecdf plot below that the truncated data and the normal distribution fitted to the truncated data have about the same 10th percentile, while the 10th percentile of the untruncated data is (as we would expect) shifted to the left.

Related

95% confidence interval for smooth.spline in R [duplicate]

I have used smooth.spline to estimate a cubic spline for my data. But when I calculate the 90% point-wise confidence interval using equation, the results seems to be a little bit off. Can someone please tell me if I did it wrongly? I am just wondering if there is a function that can automatically calculate a point-wise interval band associated with smooth.spline function.
boneMaleSmooth = smooth.spline( bone[males,"age"], bone[males,"spnbmd"], cv=FALSE)
error90_male = qnorm(.95)*sd(boneMaleSmooth$x)/sqrt(length(boneMaleSmooth$x))
plot(boneMaleSmooth, ylim=c(-0.5,0.5), col="blue", lwd=3, type="l", xlab="Age",
ylab="Relative Change in Spinal BMD")
points(bone[males,c(2,4)], col="blue", pch=20)
lines(boneMaleSmooth$x,boneMaleSmooth$y+error90_male, col="purple",lty=3,lwd=3)
lines(boneMaleSmooth$x,boneMaleSmooth$y-error90_male, col="purple",lty=3,lwd=3)
Because I am not sure if I did it correctly, then I used gam() function from mgcv package.
It instantly gave a confidence band but I am not sure if it is 90% or 95% CI or something else. It would be great if someone can explain.
males=gam(bone[males,c(2,4)]$spnbmd ~s(bone[males,c(2,4)]$age), method = "GCV.Cp")
plot(males,xlab="Age",ylab="Relative Change in Spinal BMD")
I'm not sure the confidence intervals for smooth.spline have "nice" confidence intervals like those form lowess do. But I found a code sample from a CMU Data Analysis course to make Bayesian bootstap confidence intervals.
Here are the functions used and an example. The main function is spline.cis where the first parameter is a data frame where the first column are the x values and the second column are the y values. The other important parameter is B which indicates the number bootstrap replications to do. (See the linked PDF above for the full details.)
# Helper functions
resampler <- function(data) {
n <- nrow(data)
resample.rows <- sample(1:n,size=n,replace=TRUE)
return(data[resample.rows,])
}
spline.estimator <- function(data,m=300) {
fit <- smooth.spline(x=data[,1],y=data[,2],cv=TRUE)
eval.grid <- seq(from=min(data[,1]),to=max(data[,1]),length.out=m)
return(predict(fit,x=eval.grid)$y) # We only want the predicted values
}
spline.cis <- function(data,B,alpha=0.05,m=300) {
spline.main <- spline.estimator(data,m=m)
spline.boots <- replicate(B,spline.estimator(resampler(data),m=m))
cis.lower <- 2*spline.main - apply(spline.boots,1,quantile,probs=1-alpha/2)
cis.upper <- 2*spline.main - apply(spline.boots,1,quantile,probs=alpha/2)
return(list(main.curve=spline.main,lower.ci=cis.lower,upper.ci=cis.upper,
x=seq(from=min(data[,1]),to=max(data[,1]),length.out=m)))
}
#sample data
data<-data.frame(x=rnorm(100), y=rnorm(100))
#run and plot
sp.cis <- spline.cis(data, B=1000,alpha=0.05)
plot(data[,1],data[,2])
lines(x=sp.cis$x,y=sp.cis$main.curve)
lines(x=sp.cis$x,y=sp.cis$lower.ci, lty=2)
lines(x=sp.cis$x,y=sp.cis$upper.ci, lty=2)
And that gives something like
Actually it looks like there might be a more parametric way to calculate confidence intervals using the jackknife residuals. This code comes from the S+ help page for smooth.spline
fit <- smooth.spline(data$x, data$y) # smooth.spline fit
res <- (fit$yin - fit$y)/(1-fit$lev) # jackknife residuals
sigma <- sqrt(var(res)) # estimate sd
upper <- fit$y + 2.0*sigma*sqrt(fit$lev) # upper 95% conf. band
lower <- fit$y - 2.0*sigma*sqrt(fit$lev) # lower 95% conf. band
matplot(fit$x, cbind(upper, fit$y, lower), type="plp", pch=".")
And that results in
And as far as the gam confidence intervals go, if you read the print.gam help file, there is an se= parameter with default TRUE and the docs say
when TRUE (default) upper and lower lines are added to the 1-d plots at 2 standard errors above and below the estimate of the smooth being plotted while for 2-d plots, surfaces at +1 and -1 standard errors are contoured and overlayed on the contour plot for the estimate. If a positive number is supplied then this number is multiplied by the standard errors when calculating standard error curves or surfaces. See also shade, below.
So you can adjust the confidence interval by adjusting this parameter. (This would be in the print() call.)
The R package mgcv calculates smoothing splines and Bayesian "confidence intervals." These are not confidence intervals in the usual (frequentist) sense, but numerical simulations have shown that there is almost no difference; see the linked paper by Marra and Wood in the help file of mgcv.
library(SemiPar)
data(lidar)
require(mgcv)
fit=gam(range~s(logratio), data = lidar)
plot(fit)
with(lidar, points(logratio, range-mean(range)))

Plot a quadratic relation for a predictor of a cox regression with R

I need to plot the relative risk for a quadratic effect in a cox regression. My model looks like this:
cox_mod <- coxph(Surv(time, status) ~
ph.karno + pat.karno + meal.cal + meal.cal_q,
data = lung)
Where meal.cal_q is defined as:
lung$meal.cal_q <- lung$meal.cal^2
The plot should consider the coefficients of meal.cal and meal.cal_q and show the relative risk on the y-axis and the meal.cal values on the x-axis. The relative risk should be defined as the risk at a given meal.cal value compared to all of the predictors being at their mean. Additionaly, the plot should include the 95% confidence intervals. The plot should look something like this:
Expected plot
If possible, the plot should be a ggplot object so that I can customize it.
I have been reading for hours on the web, but can not figure out how make the described plot and hope someone can help me. I tried it for example with the predict() function:
meal.cal_new <- seq(min(lung$meal.cal, na.rm= TRUE), max(lung$meal.cal, na.rm= TRUE), by= 1)
meal.cal_q_new <- meal.cal_new^2
n <- length(meal.cal_new)
lung_new <- data.frame(ph.karno= rep(mean(lung$ph.karno, na.rm= TRUE), n), pat.karno= rep(mean(lung$pat.karno, na.rm= TRUE), n), meal.cal= meal.cal_new, meal.cal_q = meal.cal_q_new)
predicted_rel_risk <- predict(cox_mod, lung_new, interval = "confidence")
print(predicted_rel_risk)
Firstly, the predicted values do not include the 95% confidence itnervals. And in addition there are negative values in predicted_rel_risk which in my opinien should not be the case since the minimal relative risk should be zero.
Therefore I can not get the desired plot. So all I can do is this:
lung_new$predicted_rel_risk <- predicted_rel_risk
ggplot(lung_new, aes(meal.cal, predicted_rel_risk)) +
geom_smooth(se= TRUE)
The resulting plot does not include the confidence intervals and shows neagtive relative risk. Here is what I get:
Thank you a lot in advance!
The prediction includes negative values since you did not specify that you want to obtain the relative risk (as you stated). Try the following code
predicted_rel_risk <- predict(cox_mod, lung_new, interval = "confidence",
type= "risk")
This gives you the following plot:
Plot without negative values
In order to get the confidence intervalls as well, you can use bootstrapping. To put it short, this means that a random sample will be drawn from your data and the relative risk will be calculated. This procedure will be repeated 10,000 times, for example. This gives you 10,000 different relative risk values for each value of your predictor. You get the main line for your plot by calculating the mean relative risk for each value of your predictor. To get the condidence intervall, you need to order the relative risks from the smallest to the greatest for each value of your predictor. The 250th (9,750th) relative risk value gives you your lower (upper) ci. Again, it is the 250th (9,750th) value of each predictor value.
Hope this helps.

Altering distribution of one dataset to match another dataset

I have 2 datasets, one of modeled (artificial) data and another with observed data. They have slightly different statistical distributions and I want to force the modeled data to match the observed data distribution in the spread of the data. In other words, I need the modeled data to better represent the tails of the observed data. Here's an example.
model <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)
observed <- c(39.50,44.79,58.28,56.04,53.40,59.25,48.49,54.51,35.38,39.98,28.00,
28.49,27.74,51.92,42.53,44.91,44.91,40.00,41.51,47.92,36.98,53.40,
42.26,42.89,43.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
52.81,36.87,47.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
51.34,43.37,51.15,42.77,42.88,44.26,27.14,39.31,24.80,12.62,30.30,
34.39,25.60,38.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
34.65,39.54,47.70,38.11,43.05,29.95,22.48,24.63,35.33,41.34)
summary(model)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.62 36.98 40.38 40.28 44.91 54.15
summary(observed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.62 35.54 42.58 41.10 47.76 59.2
How can I force the model data to have the variability that the observed has in R?
Are you just modeling the distribution of observed? If so, you could generate a kernel density estimate from the observations and then resample from that modeled density distribution. For example:
library(ggplot2)
First we generate a density estimate from the observed values. This is our model of the distribution of the observed values. adjust is a parameter that determines the bandwidth. The default value is 1. Smaller values result in less smoothing (i.e., a density estimate that more closely follows small-scale structure in the data):
dens.obs = density(observed, adjust=0.8)
Now, resample from the density estimate to get the modeled values. We set prob=dens.obs$y so that the probability of a value in dens.obs$x being chosen is proportional to its modeled density.
set.seed(439)
resample.obs = sample(dens.obs$x, 1000, replace=TRUE, prob=dens.obs$y)
Put observed and modeled values in a data frame in preparation for plotting:
dat = data.frame(value=c(observed,resample.obs),
group=rep(c("Observed","Modeled"), c(length(observed),length(resample.obs))))
The ECDF (empirical cumulative distribution function) plot below shows that sampling from the kernel density estimate gives samples with a distribution similar to the observed data:
ggplot(dat, aes(value, fill=group, colour=group)) +
stat_ecdf(geom="step") +
theme_bw()
You can also plot the density distribution of the observed data and the values sampled from the modeled distribution (using the same value for the adjust parameter as we used above).
ggplot(dat, aes(value, fill=group, colour=group)) +
geom_density(alpha=0.4, adjust=0.8) +
theme_bw()
Have a look at this answer How to generate distributions given, mean, SD, skew and kurtosis in R?.
It discusses use of the SuppDists package. This package permits you to create a distribution by creating a set of parameters based on the Johnson system of distributions.

What are quantiles in ggplot stat_quantile?

Here is my reproducible data:
library("ggplot2")
library("ggplot2movies")
library("quantreg")
set.seed(2154)
msamp <- movies[sample(nrow(movies), 1000), ]
I am trying to become acquainted with stat_quantile but the example from the documentation raises a couple of questions.
mggp <- ggplot(data=msamp, mapping=aes(x=year, y=rating)) +
geom_point() +
stat_quantile(formula=y~x, quantiles=c(0, 0.25, 0.50, 0.75, 1)) +
theme_classic(base_size = 12) +
ylim(c(0,10))
mggp
To my understanding quantiles split the data into parts that are smaller than the defined cut-off values, correct? If I define quantiles like in the following code I get five lines. Why? What do they represent?
It seems that the quantiles are calculated based on the dependent variable on the y-axis (rating). Is it possible to reverse this? I mean to split the data based on quantiles in 'year'?
This function performs quantile regression, and each line is an indicator of the
From Wikipedia:
Quantile regression is a type of regression analysis used in statistics and econometrics. Whereas the method of least squares results in estimates that approximate the conditional mean of the response variable given certain values of the predictor variables, quantile regression aims at estimating either the conditional median or other quantiles of the response variable.
Thus each line in the regression plot is an estimate of the quantile value, e.g. median, 75th and 100th percentile.
You can find a detailed technical discussion in the vignette of the quantreg package.

Advice on calculating a function to describe upper bound of data

I have a scatter plot of a dataset and I am interested in calculating the upper bound of the data. I don't know if this is a standard statistical approach so what I was considering doing was splitting the X-axis data into small ranges, calculating the max for these ranges and then trying to identify a function to describe these points. Is there a function already in R to do this?
If it's relevant there are 92611 points.
You might like to look into quantile regression, which is available in the quantreg package. Whether this is useful will depend on whether you want the absolute maximum within your "windows" are whether some extreme quantile, say 95th or 99th, is acceptable? If you are not familiar with quantile regression, then consider the linear regression which fits a model for the expectation or mean response, conditional upon the model covariates. Quantile regression for the middle quantile (0.5) would fit a model to the median response, conditional upon the model covariates.
Here is an example using the quantreg package, to show you what I mean. First, generate some dummy data similar to the data you show:
set.seed(1)
N <- 5000
DF <- data.frame(Y = rev(sort(rlnorm(N, -0.9))) + rnorm(N),
X = seq_len(N))
plot(Y ~ X, data = DF)
Next, fit the model to the 99th percentile (or the 0.99 quantile):
mod <- rq(Y ~ log(X), data = DF, tau = .99)
To generate the "fitted line", we predict from the model at 100 equally spaced values in X
pDF <- data.frame(X = seq(1, 5000, length = 100))
pDF <- within(pDF, Y <- predict(mod, newdata = pDF))
and add the fitted model to the plot:
lines(Y ~ X, data = pDF, col = "red", lwd = 2)
This should give you this:
I would second Gavin's nomination for using quantile regression. Your data might be simulated with your X and Y each log-normally distributed. You can see what a plot of the joint distribution of two independent (no imposed correlation, but not necessarily cor(x,y)==0) log-normal variates looks like if you run:
x <- rlnorm(1000, log(300), sdlog=1)
y<- rlnorm(1000, log(7), sdlog=1)
plot(x,y, cex=0.3)
You might consider looking at their individual distributions with qqplot (in the base plotting functions) remembering that the tails of such distrubutions can behave in surprising manner. You should be more interested in how well the bulk of the values fit a particular distribution than the extremes ... unless of course your applications are in finance or insurance. Don't want another global financial crisis because of poor modeling assumptions about tail behavior, now do we?
qqplot(x, rlnorm(10000, log(300), sdlog=1) )

Resources