How to create a kernel density estimation with R? - r

I would like to program a kernel estimate (with Epanechnikov kernel^1 for example). I tried the following code^2 by putting the manual code (blue) and the default code (red) on the same figure (see attached) but it always gives a difference between the two density curves!
1: The analytic form of the Epanechnikov kernel is:
kappa(u) = (1-u^2), support |u| <=1, with u = (x-x_{i})/h.
2: My trial code:
x= faithful$eruptions
fit2 <- density(x, bw = 0.6, kernel = "epanechnikov")
xgrid = seq(-1, 8, 0.1)
kernelEpan <- function(x, obs, h) sum((1-((x-obs)/h)^2)*(abs(x-obs)<=h))/h
plot(xgrid, sapply(xgrid, FUN = kernelEpan, obs = faithful$eruptions, h = 0.6)/length(faithful$eruptions), type = "l", col = "blue")
lines(fit2, col = "red")

If you read the docs for bw in the density function, you will see:
bw : the smoothing bandwidth to be used. The kernels are scaled such that this is the standard deviation of the smoothing kernel.
Which means that in order for your function's h parameter to match the behaviour of the bw parameter, you will need to rescale the h parameter by multiplying it by sqrt(5).
I would be tempted to vectorize your function, which allows you to normalize it accurately too:
kernelEpan <- function(xvals, obs, h) {
h <- h * sqrt(5)
dens <- sapply(xvals, function(x) {
u <- abs(x - obs) / h
u <- ifelse(u > 1, 1, u)
sum(1 - u^2)
})
dens / sum(dens * mean(diff(xvals)))
}
This allows:
fit1 <- kernelEpan(xgrid, obs = faithful$eruptions, h = 0.6)
fit2 <- density(x, bw = 0.6, kernel = "epanechnikov")
plot(xgrid, fit1, type = "l", col = "blue")
lines(fit2, col = "red")

Related

Force GAM model fit to be monotonic and go through a fixed point (x0, y0) with R mgcv

I am trying to fit a GAM model to data under two constraints simultatenously: (1) the fit is monotonic (increasing), (2) the fit goes through a fixed point, say, (x0,y0).
So far, I managed to have these two constraints work separately:
For (1), based on mgcv::pcls() documentation examples, by using mgcv::mono.con() to get linear constraints sufficient for monotonicity, and estimate model coefs via mgcv::pcls(), using the constraints.
For (2), based on this post, by setting the value of spline at knot location x0 to 0 + using offset term in the model formula.
However, I struggle to combine these two constraints simultaneously. I guess a way to go is mgcv::pcls(), but I could work out neither (a) doing a similar trick of setting the value of spline at knot location x0 to 0 + using offset nor (b) setting equality constraint(s) (which I think could yield my (2) constraint setup).
I also note that the approach for setting the value of spline at knot location x0 to 0 for my constrain condition (2) yields weirdly wiggly outcome (as compared to unconstrained GAM fit) -- as showed below.
Attempt so far: fit a smooth function to data under two constraints separately
Simulate some data
library(mgcv)
set.seed(1)
x <- sort(runif(100) * 4 - 1)
f <- exp(4*x)/(1+exp(4*x))
y <- f + rnorm(100) * 0.1
dat <- data.frame(x=x, y=y)
GAM unconstrained (for comparison)
k <- 13
fit0 <- gam(y ~ s(x, k = k, bs = "cr"), data = dat)
# predict from unconstrained GAM fit
newdata <- data.frame(x = seq(-1, 3, length.out = 1000))
newdata$y_pred_fit0 <- predict(fit0, newdata = newdata)
GAM constrained: (1) the fit is monotonic (increasing)
k <- 13
# Show regular spline fit (and save fitted object)
f.ug <- gam(y~s(x,k=k,bs="cr"))
# explicitly construct smooth term's design matrix
sm <- smoothCon(s(x,k=k,bs="cr"),dat,knots=NULL)[[1]]
# find linear constraints sufficient for monotonicity of a cubic regression spline
# it assumes "cr" is the basis and its knots are provided as input
F <- mono.con(sm$xp)
G <- list(
X=sm$X,
C=matrix(0,0,0), # [0 x 0] matrix (no equality constraints)
sp=f.ug$sp, # smoothing parameter estimates (taken from unconstrained model)
p=sm$xp, # array of feasible initial parameter estimates
y=y,
w= dat$y * 0 + 1 # weights for data
)
G$Ain <- F$A # matrix for the inequality constraints
G$bin <- F$b # vector for the inequality constraints
G$S <- sm$S # list of penalty matrices; The first parameter it penalizes is given by off[i]+1
G$off <- 0 # Offset values locating the elements of M$S in the correct location within each penalty coefficient matrix. (Zero offset implies starting in first location)
p <- pcls(G); # fit spline (using smoothing parameter estimates from unconstrained fit)
# predict
newdata$y_pred_fit2 <- Predict.matrix(sm, data.frame(x = newdata$x)) %*% p
# plot
plot(y ~ x, data = dat)
lines(y_pred_fit0 ~ x, data = newdata, col = 2, lwd = 2)
lines(y_pred_fit2 ~ x, data = newdata, col = 4, lwd = 2)
Blue line: constrained; red line: unconstrained
GAM constrained: (2) fitted go through (x0,y0)=(-1, -0.1)
k <- 13
## Create a spline basis and penalty
## Make sure there is a knot at the constraint point (here: -1)
knots <- data.frame(x = seq(-1,3,length=k))
# explicit construction of a smooth term in a GAM
sm <- smoothCon(s(x,k=k,bs="cr"), dat, knots=knots)[[1]]
## 1st parameter is value of spline at knot location -1, set it to 0 by dropping
knot_which <- which(knots$x == -1)
X <- sm$X[, -knot_which] ## spline basis
S <- sm$S[[1]][-knot_which, -knot_which] ## spline penalty
off <- dat$y * 0 + (-0.1) ## offset term to force curve through (x0, y0)
## fit spline constrained through (x0, y0)
gam_1 <- gam(y ~ X - 1 + offset(off), paraPen = list(X = list(S)))
# predict (add offset of -0.1)
newdata_tmp <- Predict.matrix(sm, data.frame(x = newdata$x))
newdata_tmp <- newdata_tmp[, -knot_which]
newdata$y_pred_fit1 <- (newdata_tmp %*% coef(gam_1))[, 1] + (-0.1)
# plot
plot(y ~ x, data = dat)
lines(y_pred_fit0 ~ x, data = newdata, col = 2, lwd = 2)
lines(y_pred_fit1 ~ x, data = newdata, col = 3, lwd = 2)
# lines at cross of which the plot should go throught
abline(v=-1, col = 3); abline(h=-0.1, col = 3)
Green line: constrained; red line: unconstrained
I think you could augment the data vectors x and y with (x0, y0) and then put a (really) high weight on the first observation (i.e. add a weight vector to your G list).
Alternatively to the simple weighting strategy, we can write the quadratic programming problem starting from the results of the preliminary smoothing. This is illustrated in the second R-code below (in this case I used p-spline smoothers, see Eilers and Marx 1991).
Hope this helps a bit (a similar problem is discussed here).
Rcode example 1 (weight strategy)
set.seed(123)
N = 100
x <- sort(runif(N) * 4 - 1)
f <- exp(4*x)/(1+exp(4*x))
y <- f + rnorm(N) * 0.1
x = c(-1, x)
y = c(-0.1, y)
dat = data.frame(x = x, y= y)
k <- 13
fit0 <- gam(y ~ s(x, k = k, bs = "cr"), data = dat)
# predict from unconstrained GAM fit
newdata <- data.frame(x = seq(-1, 3, length.out = 1000))
newdata$y_pred_fit0 <- predict(fit0, newdata = newdata)
k <- 13
# Show regular spline fit (and save fitted object)
f.ug <- gam(y~s(x,k=k,bs="cr"))
# explicitly construct smooth term's design matrix
sm <- smoothCon(s(x,k=k,bs="cr"),dat,knots=NULL)[[1]]
# find linear constraints sufficient for monotonicity of a cubic regression spline
# it assumes "cr" is the basis and its knots are provided as input
F <- mono.con(sm$xp)
G <- list(
X=sm$X,
C=matrix(0,0,0), # [0 x 0] matrix (no equality constraints)
sp=f.ug$sp, # smoothing parameter estimates (taken from unconstrained model)
p=sm$xp, # array of feasible initial parameter estimates
y=y,
w= c(1e8, 1:N * 0 + 1) # weights for data
)
G$Ain <- F$A # matrix for the inequality constraints
G$bin <- F$b # vector for the inequality constraints
G$S <- sm$S # list of penalty matrices; The first parameter it penalizes is given by off[i]+1
G$off <- 0 # Offset values locating the elements of M$S in the correct location within each penalty coefficient matrix. (Zero offset implies starting in first location)
p <- pcls(G); # fit spline (using smoothing parameter estimates from unconstrained fit)
# predict
newdata$y_pred_fit2 <- Predict.matrix(sm, data.frame(x = newdata$x)) %*% p
# plot
plot(y ~ x, data = dat)
lines(y_pred_fit0 ~ x, data = newdata, col = 2, lwd = 2)
lines(y_pred_fit2 ~ x, data = newdata, col = 4, lwd = 2)
abline(v = -1)
abline(h = -0.1)
rm(list = ls())
library(mgcv)
library(pracma)
library(colorout)
set.seed(123)
N = 100
x = sort(runif(N) * 4 - 1)
f = exp(4*x)/(1+exp(4*x))
y = f + rnorm(N) * 0.1
x0 = -1
y0 = -0.1
dat = data.frame(x = x, y= y)
k = 50
# Show regular spline fit (and save fitted object)
f.ug = gam(y~s(x,k=k,bs="ps"))
# explicitly construct smooth term's design matrix
sm = smoothCon(s(x,k=k,bs="ps"), dat,knots=NULL)[[1]]
# Build quadprog to estimate the coefficients
scf = sapply(f.ug$smooth, '[[', 'S.scale')
lam = f.ug$sp / scf
Xp = rbind(sm$X, sqrt(lam) * f.ug$smooth[[1]]$D)
yp = c(dat$y, rep(0, k - 2))
X0 = Predict.matrix(sm, data.frame(x = x0))
sm$deriv = 1
X1 = Predict.matrix(sm, data.frame(x = dat$x))
coef_mono = pracma::lsqlincon(Xp, yp, Aeq = X0, beq = y0, A = -X1, b = rep(0, N))
# fitted values
fit = sm$X %*% coef_mono
sm$deriv = 0
xf = seq(-1, 3, len = 1000)
Xf = Predict.matrix(sm, data.frame(x = xf))
fine_fit = Xf %*% coef_mono
# plot
par(mfrow = c(2, 1), mar = c(3,3,3,3))
plot(dat$x, dat$y, pch = 1, main= 'Data and fit')
lines(dat$x, f.ug$fitted, lwd = 2, col = 2)
lines(dat$x, fit, col = 4, lty = 1, lwd = 2)
lines(xf, fine_fit, col = 3, lwd = 2, lty = 2)
abline(h = -0.1)
abline(v = -1)
plot(dat$x, X1 %*% coef_mono, type = 'l', main = 'Derivative of the fit', lwd = 2)
abline(h = 0.0)
The following package seems to implement what you are looking for:
The proposed shape constrained smoothing has been incorporated into generalized
additive models with a mixture of unconstrained and shape restricted smooth terms
(mono-GAM). [...]
The proposed modelling approach has been implemented in an R package monogam.
The model setup is the same as in mgcv(gam) with the addition of shape constrained
smooths. In order to be consistent with the unconstrained GAM, the package provides
key functions similar to those associated with mgcv(gam).
Additive models with shape constraints

Performing residual bootstrap using kernel regression in R

Kernel regression is a non-parametric technique that wants to estimate the conditional expectation of a random variable. It uses local averaging of the response value, Y, in order to find some non-linear relationship between X and Y.
I am have used bootstrap for kernel density estimation and now want to use it for kernel regression as well. I have been told to use residual bootstrapping for kernel regression and have read a couple of papers on this. I am however unsure how to perform this. Programming has been done in R using the FKSUM package. I have made an attempt to use standard resampling on kernel regression:
library(FKSUM)
set.seed(1)
n <- 5000
sample.size <- 500
B.replications <- 200
x <- rbeta(n, 2, 2) * 10
y <- 3 * sin(2 * x) + 10 * (x > 5) * (x - 5)
y <- y + rnorm(n) + (rgamma(n, 2, 2) - 1) * (abs(x - 5) + 3)
#taking x.y to be the population
x.y <- data.frame(x, y)
xs <- seq(min(x), max(x), length = 1000)
ftrue <- 3 * sin(2 * xs) + 10 * (xs > 5) * (xs - 5)
#Sample from the population
seqx<-seq(1,5000,by=1)
sample.ind <- sample(seqx, size = sample.size, replace = FALSE)
sample.reg<-x.y[sample.ind,]
x_s <- sample.reg$x
y_s <- sample.reg$y
fhat_loc_lin.pop <- fk_regression(x, y)
fhat_loc_lin.sample <- fk_regression(x = x_s, y = y_s)
plot(x, y, col = rgb(.7, .7, .7, .3), pch = 16, xlab = 'x',
ylab = 'x', main = 'Local linear estimator with amise bandwidth')
lines(xs, ftrue, col = 2, lwd = 3)
lines(fhat_loc_lin, lty = 2, lwd = 2)
#Bootstrap
n.B.sample = sample.size # sample bootstrap size
boot.reg.mat.X <- matrix(0,ncol=B.replications, nrow=n.B.sample)
boot.reg.mat.Y <- matrix(0,ncol=B.replications, nrow=n.B.sample)
fhat_loc_lin.boot <- matrix(0,ncol = B.replications, nrow=100)
Temp.reg.y <- matrix(0,ncol = B.replications,nrow = 1000)
for(i in 1:B.replications){
sequence.x.boot <- seq(from=1,to=n.B.sample,by=1)
sample.ind.boot <- sample(sequence.x.boot, size = sample.size, replace = TRUE)
boot.reg.mat <- sample.reg[sample.ind.boot,]
boot.reg.mat.X <- boot.reg.mat$x
boot.reg.mat.Y <- boot.reg.mat$y
fhat_loc_lin.boot <- fk_regression(x = boot.reg.mat.X ,
y = boot.reg.mat.Y,
h = fhat_loc_lin.sample$h)
lines(y=fhat_loc_lin.boot$y,x= fhat_loc_lin.sample$x, col =c(i) )
Temp.reg.y[,i] <- fhat_loc_lin.boot$y
}
quan.reg.l <- vector()
quan.reg.u <- vector()
for(i in 1:length(xs)){
quan.reg.l[i] <- quantile(x = Temp.reg.y[i,],probs = 0.025)
quan.reg.u[i] <- quantile(x = Temp.reg.y[i,],probs = 0.975)
}
# Lower Bound
Temp.reg.2 <- quan.reg.l
lines(y=Temp.reg.2,x=fhat_loc_lin.boot$x ,col="red",lwd=4,lty=1)
# Upper Bound
Temp.reg.3 <- quan.reg.u
lines(y=Temp.reg.3,x=fhat_loc_lin.boot$x ,col="navy",lwd=4,lty=1)
Asking the question on here now since I haven't received any response on CV. Any help would be greatly appreciated!

After fitting the cumulative distribution in R creating the normal distribution from fitted parameters

After successfully fitting my cumulative data with Gompertz function, I need to create normal distribution from fitted function.
This is the code so far:
df <- data.frame(x = c(0.01,0.011482,0.013183,0.015136,0.017378,0.019953,0.022909,0.026303,0.0302,0.034674,0.039811,0.045709,0.052481,0.060256,0.069183,0.079433,0.091201,0.104713,0.120226,0.138038,0.158489,0.18197,0.20893,0.239883,0.275423,0.316228,0.363078,0.416869,0.47863,0.549541,0.630957,0.724436,0.831764,0.954993,1.096478,1.258925,1.44544,1.659587,1.905461,2.187762,2.511886,2.884031,3.311311,3.801894,4.365158,5.011872,5.754399,6.606934,7.585776,8.709636,10,11.481536,13.182567,15.135612,17.378008,19.952623,22.908677,26.30268,30.199517,34.673685,39.810717,45.708819,52.480746,60.255959,69.183097,79.432823,91.201084,104.712855,120.226443,138.038426,158.489319,181.970086,208.929613,239.883292,275.42287,316.227766,363.078055,416.869383,478.630092,549.540874,630.957344,724.43596,831.763771,954.992586,1096.478196),
y = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00044816,0.00127554,0.00221488,0.00324858,0.00438312,0.00559138,0.00686054,0.00817179,0.00950625,0.01085188,0.0122145,0.01362578,0.01514366,0.01684314,0.01880564,0.02109756,0.0237676,0.02683182,0.03030649,0.0342276,0.03874555,0.04418374,0.05119304,0.06076553,0.07437854,0.09380666,0.12115065,0.15836926,0.20712933,0.26822017,0.34131335,0.42465413,0.51503564,0.60810697,0.69886817,0.78237651,0.85461023,0.91287236,0.95616228,0.98569093,0.99869001,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999))
library(drc)
fm <- drm(y ~ x, data = df, fct = G.3())
options(scipen = 10) #to avoid scientific notation in x axis
plot(df$x, predict(fm),type = "l", log = "x",col = "blue",
main = "Cumulative function distribution",xlab = "x", ylab = "y")
points(df,col = "red")
legend("topleft", inset = .05,legend = c("exp","fit")
,lty = c(NA,1), col = c("red", "blue"), pch = c(1,NA), lwd=1, bty = "n")
summary(fm)
And this is the following plot:
My idea is now to transform somehow this cumulative fit to the normal distribution. Is there any idea how could I do that?
While your original intention might be non-parametric, I suggest using parametric estimation method: method of moments, which is widely used for problems like this, because you have a certain parametric distribution (normal distribution) to fit. The idea is quite simple, from the fitted cumulative distribution function, you can calculate the mean (E1 in my code) and variance (square of SD in my code), and then the problem is solved, because normal distribution can be totally determined by mean and variance.
df <- data.frame(x=c(0.01,0.011482,0.013183,0.015136,0.017378,0.019953,0.022909,0.026303,0.0302,0.034674,0.039811,0.045709,0.052481,0.060256,0.069183,0.079433,0.091201,0.104713,0.120226,0.138038,0.158489,0.18197,0.20893,0.239883,0.275423,0.316228,0.363078,0.416869,0.47863,0.549541,0.630957,0.724436,0.831764,0.954993,1.096478,1.258925,1.44544,1.659587,1.905461,2.187762,2.511886,2.884031,3.311311,3.801894,4.365158,5.011872,5.754399,6.606934,7.585776,8.709636,10,11.481536,13.182567,15.135612,17.378008,19.952623,22.908677,26.30268,30.199517,34.673685,39.810717,45.708819,52.480746,60.255959,69.183097,79.432823,91.201084,104.712855,120.226443,138.038426,158.489319,181.970086,208.929613,239.883292,275.42287,316.227766,363.078055,416.869383,478.630092,549.540874,630.957344,724.43596,831.763771,954.992586,1096.478196),
y=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00044816,0.00127554,0.00221488,0.00324858,0.00438312,0.00559138,0.00686054,0.00817179,0.00950625,0.01085188,0.0122145,0.01362578,0.01514366,0.01684314,0.01880564,0.02109756,0.0237676,0.02683182,0.03030649,0.0342276,0.03874555,0.04418374,0.05119304,0.06076553,0.07437854,0.09380666,0.12115065,0.15836926,0.20712933,0.26822017,0.34131335,0.42465413,0.51503564,0.60810697,0.69886817,0.78237651,0.85461023,0.91287236,0.95616228,0.98569093,0.99869001,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999))
library(drc)
fm <- drm(y ~ x, data = df, fct = G.3())
options(scipen = 10) #to avoid scientific notation in x axis
plot(df$x, predict(fm),type="l", log = "x",col="blue", main="Cumulative distribution function",xlab="x", ylab="y")
points(df,col="red")
E1 <- sum((df$x[-1] + df$x[-length(df$x)]) / 2 * diff(predict(fm)))
E2 <- sum((df$x[-1] + df$x[-length(df$x)]) ^ 2 / 4 * diff(predict(fm)))
SD <- sqrt(E2 - E1 ^ 2)
points(df$x, pnorm((df$x - E1) / SD), col = "green")
legend("topleft", inset = .05,legend= c("exp","fit","method of moment")
,lty = c(NA,1), col = c("red", "blue", "green"), pch = c(1,NA), lwd=1, bty="n")
summary(fm)
And the estimation results:
## > E1 (mean of fitted normal distribution)
## [1] 65.78474
## > E2 (second moment of fitted normal distribution)
##[1] 5792.767
## > SD (standard deviation of fitted normal distribution)
## [1] 38.27707
## > SD ^ 2 (variance of fitted normal distribution)
## [1] 1465.134
Edit: updated method to calculate moments from cdf fitted by drc. The function moment defined below calculates moment estimation using the moment formula for continuous r.v. E(X ^ k) = k * \int x ^ {k - 1} (1 - cdf(x)) dx. These are the best estimation I can get from the fitted cdf. And the fit is not very good when x is near zero because of the reason in original datasets as I discussed in comments.
df <- data.frame(x=c(0.01,0.011482,0.013183,0.015136,0.017378,0.019953,0.022909,0.026303,0.0302,0.034674,0.039811,0.045709,0.052481,0.060256,0.069183,0.079433,0.091201,0.104713,0.120226,0.138038,0.158489,0.18197,0.20893,0.239883,0.275423,0.316228,0.363078,0.416869,0.47863,0.549541,0.630957,0.724436,0.831764,0.954993,1.096478,1.258925,1.44544,1.659587,1.905461,2.187762,2.511886,2.884031,3.311311,3.801894,4.365158,5.011872,5.754399,6.606934,7.585776,8.709636,10,11.481536,13.182567,15.135612,17.378008,19.952623,22.908677,26.30268,30.199517,34.673685,39.810717,45.708819,52.480746,60.255959,69.183097,79.432823,91.201084,104.712855,120.226443,138.038426,158.489319,181.970086,208.929613,239.883292,275.42287,316.227766,363.078055,416.869383,478.630092,549.540874,630.957344,724.43596,831.763771,954.992586,1096.478196),
y=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.00044816,0.00127554,0.00221488,0.00324858,0.00438312,0.00559138,0.00686054,0.00817179,0.00950625,0.01085188,0.0122145,0.01362578,0.01514366,0.01684314,0.01880564,0.02109756,0.0237676,0.02683182,0.03030649,0.0342276,0.03874555,0.04418374,0.05119304,0.06076553,0.07437854,0.09380666,0.12115065,0.15836926,0.20712933,0.26822017,0.34131335,0.42465413,0.51503564,0.60810697,0.69886817,0.78237651,0.85461023,0.91287236,0.95616228,0.98569093,0.99869001,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999,0.99999999))
library(drc)
fm <- drm(y ~ x, data = df, fct = G.3())
moment <- function(k){
f <- function(x){
x ^ (k - 1) * pmax(0, 1 - predict(fm, data.frame(x = x)))
}
k * integrate(f, lower = min(df$x), upper = max(df$x))$value
}
E1 <- moment(1)
E2 <- moment(2)
SD <- sqrt(E2 - E1 ^ 2)
I was thinking of the cumdiff (for lack of a better term). The link helped a lot.
EDIT
plot(df$x[-1], Mod(df$y[-length(df$y)]-df$y[-1]), log = "x", type = "b",
main = "Normal distribution for original data",
xlab = "x", ylab = "y")
yielding:
ADDITION
In order to get the Gaussian from the fittedfunction:
df$y_pred<-predict(fm)
plot(df$x[-1], Mod(df$y_pred[-length(df$y_pred)]-df$y_pred[-1]), log = "x",
type = "b", main="Normal distribution for fitted function",
xlab = "x", lab = "y")
yielding:

Plotting the CDF and Quantile Functions Given the PDF

How would I plot the CDF and Quantile functions, in R, if I have the PDF. Currently, I have the following (but I think there must be a better way to do it):
## Probability Density Function
p <- function(x) {
result <- (x^2)/9
result[x < 0 | x > 3] <- 0
result
}
plot(p, xlim = c(0,3), main="Probability Density Function")
## Cumulative Distribution Function
F <- function(a = 0,b){
result <- ((b^3)/27) - ((a^3)/27)
result[a < 0 ] <- 0
result[b > 3] <- 1
result
}
plot(F(,x), xlim=c(0,3), main="Cumulative Distribution Function")
## Quantile Function
Finv <- function(p) {
3*x^(1/3)
}
As #dash2 suggested, the CDF would need you to integrate the PDF, in essence needing you to find the area under the curve.
Here's a generic solution which should help. I am using a gaussian distribution as an example - you should be able to feed to it any generic function.
Note that quantiles reported are approximations only. Also, dont forget to look into the documentation for integrate().
# CDF Function
CDF <- function(FUNC = p, plot = T, area = 0.5, LOWER = -10, UPPER = 10, SIZE = 1000){
# Create data
x <- seq(LOWER, UPPER, length.out = SIZE)
y <- p(x)
area.vec <- c()
area.vec[1] <- 0
for(i in 2:length(x)){
x.vec <- x[1:i]
y.vec <- y[1:i]
area.vec[i] = integrate(p, lower = x[1], upper = x[i])$value
}
# Quantile
quantile = x[which.min(abs(area.vec - area))]
# Plot if requested
if(plot == TRUE){
# PDF
par(mfrow = c(1, 2))
plot(x, y, type = "l", main = "PDF", col = "indianred", lwd = 2)
grid()
# CDF
plot(x, area.vec, type = "l", main = "CDF", col = "slateblue",
xlab = "X", ylab = "CDF", lwd = 2)
# Quantile
mtext(text = paste("Quantile at ", area, "=",
round(quantile, 3)), side = 3)
grid()
par(mfrow = c(1, 1))
}
}
# Sample data
# PDF Function - Gaussian distribution
p <- function(x, SD = 1, MU = 0){
y <- (1/(SD * sqrt(2*pi)) * exp(-0.5 * ((x - MU)/SD) ^ 2))
return(y)
}
# Call to function
CDF(p, area = 0.5, LOWER = -5, UPPER = 5)

fitting function for a given data set

I'm trying to fitting the following function y(x)=a*( 1 + (x^2)/(b^2) )^t to a particular set of data , where, a, b and t are constants that want to determine by fitting.
I try the following, for example
len <- 24
x = runif(len)
y = x^3 + runif(len, min = -0.1, max = 0.1)
plot(x, y)
s <- seq(from = 0, to = 1, length = 50)
lines(s, s^3, lty = 2)
df <- data.frame(x, y)
m <- nls(y~a*( 1 + (x^2)/(b^2) )^t, data = df, start = list(a=1,t=0, b=1), trace = T)
> Error in nlsModel(formula, mf, start, wts) :
singular gradient matrix at initial parameter estimates
Can someone help me to set this function to these points, even if the fitting becomes bad, the important is to get fit this function, ie that she run on the data
thanks everyone
Because your data are changing randomly, for some situations the value of a is close to zero and your function becomes zero. The curve fit procedure fails at that point. Randomizing the start parameters might work for some situations.
A slightly more stable output can be computed using the LM algorithm:
require("minpack.lm")
LMCurveFit <- function(df) {
# The function to be fit
FitFunction <- function(params, x) {
with (
as.list(params), {
a*(1 + x^2/b^2)^t
}
)
}
# Residual
Residual <- function(params, x, y) {
FitFunction(params, x) - y
}
# Sum of squares of residuals
ssqfun <- function(params, x, y) {
sum(Residual(params, x, y)^2)
}
# Normalize the data
x_max = max(x)
y_max = max(y)
df$x = df$x/x_max
df$y = df$y/y_max
# Define start parameters
a_start = 0.1
b_start = 1.0
t_start = 1.0
param_start = c(a = a_start,
b = b_start,
t = t_start)
# Do LM fit
nls.out <- nls.lm(par = param_start,
fn = Residual,
control = nls.lm.control(nprint=0,
ftol=.Machine$double.eps,
ptol=.Machine$double.eps,
maxfev=10000, maxiter=1024),
x = df$x,
y = df$y)
# Revert scaling
nls.out$par[1] = nls.out$par[1]*y_max
nls.out$par[2] = nls.out$par[2]*x_max
# Get the parameters
params_fit = coef(nls.out)
print(params_fit)
# Compute predicted values
predicted = FitFunction(as.list(params_fit), df$x*x_max)
}
# LM fit
pred_y = LMCurveFit(df)
lines(x, pred_y)

Resources