R: Turn a [KDE] density plot into a cdf? - r

Data:
34,46,47,48,52,53,55,56,56,56,57,58,59,59,68
Density Plot
ECDF
What I'd like to do is take the derived density plot and turn it into a cumulative distribution frequency to derive %'s from. And vice versa.
My hope is to use the kernel density estimation specifically to derive a smoothed cumulative distribution function. I don't wish to rely on the raw data points to do a ECDF, but use the KDE to do a CDF.
Edit:
I see there is a KernelSmoothing.CDF, might this be the solution? If it is, I have no idea how to implement it so far.
Mathworks has an example of what I'm trying to do, converting from an ECDF to a KECDF under "Compute and plot the estimated cdf evaluated at a specified set of values."
http://www.mathworks.com/help/stats/examples/nonparametric-estimates-of-cumulative-distribution-functions-and-their-inverses.html?requestedDomain=www.mathworks.com
although I think the implementation is fairly sloppy. Considering a polynomial regression line would be a better fit.

library("DiagTest3Grp", lib.loc='~/R/win-library/3.2")
data <- c(34,46,47,48,52,53,55,56,56,56,57,58,59,59,68)
bw <- BW.ref(data)
x0 <- seq(0, 100, .1)
KS.cdfvec <- Vectorize(KernelSmoothing.cdf, vectorize.args = "c0")
x0.cdf <- KS.cdfvec(xx = data, c0 = x0, bw = bw)
plot(x0, x0.cdf, type = "l")
I still need to figure out how to derive y given x, but this was a major help

Related

Empirical CDF vs Theoretical CDF in R

I want to check the "probability integral transform" theorem using R.
Let's suppose X is an exponential random variable with lambda = 5.
I want to check that the random variable U = F_X = 1 - exp(-5*X) has a uniform (0,1) distribution.
How would you do it?
I would start in this way:
nsample <- 1000
lambda <- 5
x <- rexp(nsample, lambda) #1000 exponential observation
u <- 1- exp(-lambda*x) #CDF of x
Then I need to find the CDF of u and compare it with the CDF of a Uniform (0,1).
For the empirical CDF of u I could use the ECDF function:
ECDF_u <- ecdf(u) #empirical CDF of U
Now I should create the theoretical CDF of Uniform (0,1) and plot it on the same graph of the ECDF in order to compare the two graphs.
Can you help with the code?
You are almost there. You don't need to compute the ECDF yourself – qqplot will take care of this. All you need is your sample (u) and data from the distribution you want to check against. The lazy (and not quite correct) approach would be to check against a random sample drawn from a uniform distribution:
qqplot(runif(nsample), u)
But of course, it is better to plot against the theoretical quantiles:
# the actual plot
qqplot( qunif(ppoints(length(u))), u )
# add a line
qqline(u, distribution=qunif, col='red', lwd=2)
Looks pretty good to me.

R - poly.calc not stable when using many points

I've been trying to solve a problem using Lagrange interpolation, which is implemented in poly.calc method (polynom package) in R language.
Basically, my problem is to predict the population of a certain country using Lagrange Interpolation. I have the population from the past years (1961 - 2014). The csv file is here
w1 = read.csv(file="country.csv", sep=",", head=TRUE)
array_x = w1$x
array_y = w1$y
#calls Lagrange Method
p = poly.calc(array_x, array_y)
#create a function to evaluate the polynom
prf <- as.function(p)
#create some points to plot
myx = seq(1961, 2020, 0.5)
#y's to plot
myy = prf(myx)
#plot
plot(myx, myy,col='blue')
After that, the plotted curve is declining and the y-axis is (very big) negative (power of 134).
It does not make sense.
However, if I use like five points, it is correct.
This is not really an SO question but rather a numerical analysis question.
R is doing everything you want it to, it's not a programming error. It's just that what you want it to do is notoriously bad. Lagrange polynomials are notorious for being incredibly unstable, especially when a large number of points are fit.
A much more stable alternative is the use of splines, such as B-splines. They can be fit very easily with R's default spline library into any regression model, i.e. you could fit a least squares model with
library(splines)
x <- sort(runif(500, -3,3) ) #sorting makes for easier plotting ahead
y <- sin(x)
splineFit <- lm(y ~ bs(x, df = 5) )
est_y <- predict(splineFit)
plot(x, y, type = 'l')
lines(x, est_y, col = 'blue')
You can see from the above model that the splines can do a good job of fitting non-linear relations.

How to plot a log curve in R?

I have the following set of data:
x = c(8,16,64,128,256)
y = c(7030.8, 3624.0, 1045.8, 646.2, 369.0)
Which, when plotted, looks like an exponential decay or negative ln function.
I'm trying to fit a smooth curve to this data, but I don't know how. I've tried nls and lm functions, but I can't seem to get it right. The online examples have too many steps for the simple data I have, and I can't understand well enough to modify the examples for what I need. Any help or advice would be appreciated. Thank you.
Edit: When I say I tried nls and lm functions, I mean that the lines produced were linear, no matter what parameters I tried.
And when I say too many steps, I mean the examples I found were for predicting with 2 independent variables, or for creating multiple fit lines.
What I'm asking is what is the best way to fit a simple smooth line to data that, when graphed, looks like an exponential decay or negative ln. What the equation of the line is isn't important, it's meant to be a reference for the shape of the data.
A good way to fit a curve to a function is the built-in nls function, which performs non-linear least squares optimization. For example, if you wanted to fit the model y = b * x^e, you could do:
n <- nls(y ~ b * x ^ e, data = data.frame(x, y), start = c(b = 1000, e = -1))
(?nls, or this walkthrough, can tell you more about these options). You could then plot the curve on top of your points:
plot(x, y)
curve(predict(n, newdata = data.frame(x = x)), add = TRUE)
You can try a few other models (specified by that formula in nls) that may fit your data.
Maybe 'lowess' is what you're looking for? Try:
plot(y ~ x)
lines(lowess(y ~ x))
That function just connects the dots. It sounds like you would prefer something that smooths out the elbows. In principle, 'loess' is useful for that, but you don't have enough data points here for that to work.

How can I get the value of a kernel density estimate at specific points?

I am experimenting with ways to deal with overplotting in R, and one thing I want to try is to plot individual points but color them by the density of their neighborhood. In order to do this I would need to compute a 2D kernel density estimate at each point. However, it seems that the standard kernel density estimation functions are all grid-based. Is there a function for computing 2D kernel density estimates at specific points that I specify? I would imagine a function that takes x and y vectors as arguments and returns a vector of density estimates.
If I understand what you want to do, it could be achieved by fitting a smoothing model to the grid density estimate and then using that to predict the density at each point you are interested in. For example:
# Simulate some data and put in data frame DF
n <- 100
x <- rnorm(n)
y <- 3 + 2* x * rexp(n) + rnorm(n)
# add some outliers
y[sample(1:n,20)] <- rnorm(20,20,20)
DF <- data.frame(x,y)
# Calculate 2d density over a grid
library(MASS)
dens <- kde2d(x,y)
# create a new data frame of that 2d density grid
# (needs checking that I haven't stuffed up the order here of z?)
gr <- data.frame(with(dens, expand.grid(x,y)), as.vector(dens$z))
names(gr) <- c("xgr", "ygr", "zgr")
# Fit a model
mod <- loess(zgr~xgr*ygr, data=gr)
# Apply the model to the original data to estimate density at that point
DF$pointdens <- predict(mod, newdata=data.frame(xgr=x, ygr=y))
# Draw plot
library(ggplot2)
ggplot(DF, aes(x=x,y=y, color=pointdens)) + geom_point()
Or, if I just change n 10^6 we get
I eventually found the precise function I was looking for: interp.surface from the fields package. From the help text:
Uses bilinear weights to interpolate values on a rectangular grid to arbitrary locations or to another grid.

finding functions that match dot plots

ggplot(test,aes(x=timepoints,y= mean,ymax = mean + sde, ymin = mean - sde)) +
geom_errorbar(width=2) +
geom_point() +
geom_line() +
stat_smooth(method='loess') +
xlab('Time (min)') +
ylab('Fold Induction') +
opts(title = 'yo')
I can plot the blue 'loess'-ed line. But is there a way to find the mathematical function of the blue 'loess'-ed line?
You can get the predictions for a regular sequence:
fit <-loess( mean ~ timepoints, data=test)
fit.points <- predict(fit, newdata= data.frame(
speed = seq(min(timepoints), max(timepoints), length=100)),
se = FALSE)
fitdf <- dataframe(x = seq(min(timepoints), max(timepoints), length=100)
y = fit.points)
You can then fit to that set of points with splines of an appropriate degree. Cubic spline fits can be described with greater ease than can loess fits.It would be easier to synchronize an answer to variable names it you had offered a data example to work with. The plot does not seem to be created with that code.
Rule Number One: not all distributions have a (closed-form) function which generates them. Yes, you can create a close fit by way of splines, or calculating moments (mean, variance, skew, etc) and building the series, so your choice depends on whether you intend to interpolate, extrapolate, or just "view" the resultant function.
In the scientific world, it's more common to have a theory, or premise, about the behavior behind your data. You can then do standard (e.g. nls) fitting methods to see how well the proposed fit function can be made to match your data.
To understand how the loess line is computed see the loess.demo function in the TeachingDemos package. This is an interactive graphical demonstration that will show how the y-value at each point is computed for each x-value based on the data and bandwidth parameter (it also shows the difference in the raw loess fit and the spline that is often fit to the loess estimates).

Resources