Unable to produce values of a kernel density estimator in R - r

I'm simulating random numbers from the exponential distribution with rate=1. I have plotted a kernel density of the data using the density() function in R. What I want is a function f that gives me the value of the density at any point. I have tried the following code:
n=10^5
x=rexp(n,rate=1)
d=density(x,kernel="gaussian")
f=function(x){d$y[x]}
f(1)
plot(d)
However,
f(1) clearly does not match the value of the density function at the point x=1. Where am I going wrong?

density produces a list which contains, in field x, the coordinates of the points at which the density is evaluated and, in field y, the estimated values of the density at these coordinates:
> str(d)
List of 7
$ x : num [1:512] -0.348 -0.328 -0.307 -0.286 -0.266 ...
$ y : num [1:512] 0.00146 0.00256 0.00435 0.00717 0.01147 ...
......
To get a function from x and y, you can use the approxfun function:
> f <- approxfun(d$x, d$y)
> f(1)
[1] 0.3665273
> dexp(1, rate=1)
[1] 0.3678794
In your code you get d$y[1], which is the first value of d$y.

Related

What exactly does the rexp()-method in R do?

i am trying to generate random values for an Exp(0.5) distribution.
I have the following code :
y <- rexp(10, rate=1/2)
y
This gives me:
[1] 4.5582556 2.3285161 4.2466828 0.9995597 3.6326827 0.1016917 0.2518124
[8] 0.3189424 0.8553740 0.8277078
The problem i have here is that i don´t know what this values mean. They can´t be values of the density function of Exp(1/2) which is , because the density function is defined as 0 for x < 0 and f(x) = 4.55 for x < 0.
What do these values mean?
For people coming to R from statistics, for any distribution:
d - PDF
p - CDF
q - inverse CDF
r - sampling according to PDF

Any suggestions for how I can plot mixEM type data using ggplot2

I have a sample of 1m records obtained from my original data. (For your reference, you may use this dummy data that may generate approximately similar distribution
b <- data.frame(matrix(rnorm(2000000, mean=c(8,17), sd=2)))
c <- b[sample(nrow(b), 1000000), ]
)
I believed the histogram to be a mixture of two log-normal distributions and I tried to fit the summed distributions using EM algorithm using the following code:
install.packages("mixtools")
lib(mixtools)
#line below returns EM output of type mixEM[] for mixture of normal distributions
c1 <- normalmixEM(c, lambda=NULL, mu=NULL, sigma=NULL)
plot(c1, density=TRUE)
The first plot is a log-likelihood plot and the second (if you hit return again), gives similar to the following density curves:
As I mentioned c1 is of type mixEM[] and plot() function can accommodate that. I want to fill the density curves with colors. This is easy to do using ggplot2() but ggplot2() does not support data of type mixEM[] and throws this message:
ggplot doesn't know how to deal with data of class mixEM
Is there any other approach I can take for this problem?
Look at the structure of the returned object (this should be documented in the help):
> # simple mixture of normals:
> x=c(rnorm(10000,8,2),rnorm(10000,17,4))
> xMix = normalmixEM(x, lambda=NULL, mu=NULL, sigma=NULL)
Now what:
> str(xMix)
List of 9
$ x : num [1:20000] 6.18 9.92 9.07 8.84 9.93 ...
$ lambda : num [1:2] 0.502 0.498
$ mu : num [1:2] 7.99 17.05
$ sigma : num [1:2] 2.03 4.02
$ loglik : num -59877
The lambda, mu, and sigma components define the returned normal densities. You can plot these in ggplot using qplot and stat_function. But first make a function that returns scaled normal densities:
sdnorm =
function(x, mean=0, sd=1, lambda=1){lambda*dnorm(x, mean=mean, sd=sd)}
Then:
qplot(x,geom="density") + stat_function(fun=sdnorm,args=list(mean=xMix$mu[1],sd=xMix$sigma[1], lambda=xMix$lambda[1]),fill="blue",geom="polygon") + stat_function(fun=sdnorm,args=list(mean=xMix$mu[2],sd=xMix$sigma[2], lambda=xMix$lambda[2]),fill="#FF0000",geom="polygon")
Or whatever ggplot skills you have. Transparent colours on the densities might be nice.
ggplot(data.frame(x=x)) +
geom_histogram(aes(x=x,y=..density..),fill="white",color="black") +
stat_function(fun=sdnorm,
args=list(mean=xMix$mu[2],
sd=xMix$sigma[2],
lambda=xMix$lambda[2]),
fill="#FF000080",geom="polygon") +
stat_function(fun=sdnorm,
args=list(mean=xMix$mu[1],
sd=xMix$sigma[1],
lambda=xMix$lambda[1]),
fill="#00FF0080",geom="polygon")
producing:
Here's a slightly different approach which uses geom_ploygon(...) instead of multiple calls to stat_function(...). One problem with stat_function(...) is that the secondary arguments (mu, sigma, and lambda in this example), which are passed using the args=list(...) parameter, cannot be included in an aesthetic mapping, so you have to have multiple calls to stat_function(...) as is #Spacedman`s solution.
This approach builds the PDFs outside of ggplot and uses a single call to geom_polygon(...). As a result, it works without modification for an arbitrary number of distributions in the mixture.
# ggplot mixture plot
gg.mixEM <- function(EM) {
require(ggplot2)
x <- with(EM,seq(min(x),max(x),len=1000))
pars <- with(EM,data.frame(comp=colnames(posterior), mu, sigma,lambda))
em.df <- data.frame(x=rep(x,each=nrow(pars)),pars)
em.df$y <- with(em.df,lambda*dnorm(x,mean=mu,sd=sigma))
ggplot(data.frame(x=EM$x),aes(x,y=..density..)) +
geom_histogram(fill=NA,color="black")+
geom_polygon(data=em.df,aes(x,y,fill=comp),color="grey50", alpha=0.5)+
scale_fill_discrete("Component\nMeans",labels=format(em.df$mu,digits=3))+
theme_bw()
}
library(mixtools)
# two components
set.seed(1) # for reproducible example
b <- rnorm(2000000, mean=c(8,17), sd=2)
c <- b[sample(length(b), 1000000) ]
c2 <- normalmixEM(c, lambda=NULL, mu=NULL, sigma=NULL)
gg.mixEM(c2)
# three components
set.seed(1)
b <- rnorm(2000000, mean=c(8,17,30), sd=c(2,3,5))
c <- b[sample(length(b), 1000000) ]
library(mixtools)
c3 <- normalmixEM(c, k=3, lambda=NULL, mu=NULL, sigma=NULL)
gg.mixEM(c3)

Obtain spline surface on R

How do I generate a b-spline surface, let's say:
x=attitude$rating
y=attitude$complaints
z=attitude$privileges
would be x and y for the spline basis. z is the set of control points.
If I understand you, you have x,y, and z data and you want to use bivariate spline interpolation on x and y, using z for the control points. You can do this with interp(...) in the akima package.
library(akima)
spline <- interp(x,y,z,linear=FALSE)
# rotatable 3D plot of points and spline surface
library(rgl)
open3d(scale=c(1/diff(range(x)),1/diff(range(y)),1/diff(range(z))))
with(spline,surface3d(x,y,z,alpha=.2))
points3d(x,y,z)
title3d(xlab="rating",ylab="complaints",zlab="privileges")
axes3d()
The plot itself is fairly uninteresting with your dataset because x, y, and x are highly correlated.
EDIT response to OP's comment.
If you want a b-spline surface, try out mba.surf(...) in the unfortunately named MBA package.
library(MBA)
spline <- mba.surf(data.frame(x,y,z),100,100)
library(rgl)
open3d(scale=c(1/diff(range(x)),1/diff(range(y)),1/diff(range(z))))
with(spline$xyz,surface3d(x,y,z,alpha=.2))
points3d(x,y,z)
title3d(xlab="rating",ylab="complaints",zlab="privileges")
axes3d()
require(rms) # Harrell's gift to the R world.
# Better to keep the original names and do so within a dataframe.
att <- attitude[c('rating','complaints','privileges')]
add <- datadist(att) # records ranges and descriptive info on data
options(datadist="add") # need these for the rms functions
# rms-`ols` function (ordinary least squares) is a version of `lm`
mdl <- ols( privileges ~ rcs(rating,4)*rcs(complaints,4) ,data=att)
# Predict is an rms function that works with rms's particular classes
pred <- Predict(mdl, 'rating','complaints')
# bplot calls lattice functions; levelplot by default; this gives a "3d" plot
bplot(pred, yhat~rating+complaints, lfun=wireframe)
It's a crossed restricted-cubic spline model. If you have a favorite spline function you want to use instead, then by all means try it out. I've had good luck with the rcs- function.
This gives a more open mesh with fewer calculated points:
pred <- Predict(mdl, 'rating','complaints', np=25)
bplot(pred, yhat~rating+complaints, lfun=wireframe)
png()
bplot(pred, yhat~rating+complaints, lfun=wireframe)
dev.off()
You could use the rgl methods being illustrated by jhoward. The top of str(pred) looks like:
str(pred)
Classes ‘Predict’ and 'data.frame': 625 obs. of 5 variables:
$ rating : num 43 44.6 46.2 47.8 49.4 ...
$ complaints: num 45 45 45 45 45 ...
$ yhat : num 39.9 39.5 39.1 38.7 38.3 ...
$ lower : num 28 28.3 27.3 25 22 ...
$ upper : num 51.7 50.6 50.9 52.4 54.6 ...
snipped
library(rgl)
open3d()
with(pred, surface3d(unique(rating),unique(complaints),yhat,alpha=.2))
with(att, points3d(rating,complaints,privileges, col="red"))
title3d(xlab="rating",ylab="complaints",zlab="privileges")
axes3d()
aspect3d(x=1,z=.05)
Good illustration of the dangers of extrapolation once you realize there are no data out on the extremes of inappropriate extrapolations from that model. The rms-package has a perimeter function and the plotting functions have a perim argument to which perimeter-objects are passed.

how to solve multi dimension integral equations with variable on upper bounds

I would like to solve an equation as below, where the X is the only unknown variable and function f() is a multi-variate Student t distribution.
More precisely, I have a multi k-dimensional integral for a student density function, which gives us a probability as a result, and I know that this probability is given as q. The lower bound for all integral is -Inf and I know the last k-1 dimension's upper bound (as given), the only unknown variable is the first integral's upper bound. It should have an solution for a variable and one equation. I tried to solve it in R. I did Dynamic Conditional Correlation to have a correlation matrix in order to specify my t-distribution. So plug this correlation matrix into my multi t distribution "dmvt", and use the "adaptIntegral" function from "cubature" package to construct a function as an argument to the command "uniroot" to solve the upper bound on the first integral. But I have some difficulties to achieve what I want to get. (I hope my question is clear) I have provided my codes before, somebody told me that there is problem, but cannot find why there is an issue there. Many thanks in advance for your help.
I now how to deal with it with one dimension integral, but I don't know how a multi-dimension integral equation can be solved in R? (e.g. for 2 dimension case)
\int_{-\infty}^{X}
\int_{-\infty}^{Y_{1}} \cdots
\int_{-\infty}^{Y_{k}}
f(x,y_{1},\cdots y_{k})
d_{x}d_{y_{1},}\cdots d_{y_{k}} = q
This code fails:
require(cubature)
require(mvtnorm)
corr <- matrix(c(1,0.8,0.8,1),2,2)
f <- function(x){ dmvt(x,sigma=corr,df=3) }
g <- function(y) adaptIntegrate(f,
lowerLimit = c( -Inf, -Inf),
upperLimit = c(y, -0.1023071))$integral-0.0001
uniroot( g, c(-2, 2))
Since mvtnorm includes a pmvt function that computes the CDF of the multivariate t distribution, you don't need to do the integral by brute force. (mvtnorm also includes a quantile function qmvt, but only for "equicoordinate" values.)
So:
library(mvtnorm)
g <- function(y1_upr,y2_upr=-0.123071,target=1e-4,df=3) {
pmvt(upper=c(y1_upr,y2_upr),df=df)-target
}
uniroot(g,c(-10000,0))
## $root
## [1] -17.55139
##
## $f.root
## [1] -1.699876e-11
## attr(,"error")
## [1] 1e-15
## attr(,"msg")
## [1] "Normal Completion"
##
## $iter
## [1] 18
##
## $estim.prec
## [1] 6.103516e-05
##
Double-check:
pmvt(upper=c(-17.55139,-0.123071),df=3)
## [1] 1e-04
## attr(,"error")
## [1] 1e-15
## attr(,"msg")
## [1] "Normal Completion"

Compute projection / hat matrix via QR factorization, SVD (and Cholesky factorization?)

I'm trying to calculate in R a projection matrix P of an arbitrary N x J matrix S:
P = S (S'S) ^ -1 S'
I've been trying to perform this with the following function:
P <- function(S){
output <- S %*% solve(t(S) %*% S) %*% t(S)
return(output)
}
But when I use this I get errors that look like this:
# Error in solve.default(t(S) %*% S, t(S), tol = 1e-07) :
# system is computationally singular: reciprocal condition number = 2.26005e-28
I think that this is a result of numerical underflow and/or instability as discussed in numerous places like r-help and here, but I'm not experienced enough using SVD or QR decomposition to fix the problem, or else put this existing code into action. I've also tried the suggested code, which is to write solve as a system:
output <- S %*% solve (t(S) %*% S, t(S), tol=1e-7)
But still it doesn't work. Any suggestions would be appreciated.
I'm pretty sure that my matrix should be invertible and does not have any co-linearities, if only because I have tried testing this with a matrix of orthogonal dummy variables, and it still doesn't work.
Also, I'd like to apply this to fairly large matrices, so I'm looking for a neat general solution.
Although OP has not been active for more than a year, I still decide to post an answer. I would use X instead of S, as in statistics, we often want projection matrix in linear regression context, where X is the model matrix, y is the response vector, while H = X(X'X)^{-1}X' is hat / projection matrix so that Hy gives predictive values.
This answer assumes the context of ordinary least squares. For weighted least squares, see Get hat matrix from QR decomposition for weighted least square regression.
An overview
solve is based on LU factorization of a general square matrix. For X'X (should be computed by crossprod(X) rather than t(X) %*% X in R, read ?crossprod for more) which is symmetric, we can use chol2inv which is based on Choleksy factorization.
However, triangular factorization is less stable than QR factorization. This is not hard to understand. If X has conditional number kappa, X'X will have conditional number kappa ^ 2. This can cause big numerical difficulty. The error message you get:
# system is computationally singular: reciprocal condition number = 2.26005e-28
is just telling this. kappa ^ 2 is about e-28, much much smaller than machine precision at around e-16. With tolerance tol = .Machine$double.eps, X'X will be seen as rank deficient, thus LU and Cholesky factorization will break down.
Generally, we switch to SVD or QR in this situation, but pivoted Cholesky factorization is another choice.
SVD is the most stable method, but too expensive;
QR is satisfyingly stable, at moderate computational costs, and is commonly used in practice;
Pivoted Cholesky is fast, with acceptable stability. For large matrix this one is preferred.
In the following, I will explain all three methods.
Using QR factorization
Note that the projection matrix is permutation independent, i.e., it does not matter whether we perform QR factorization with or without pivoting.
In R, qr.default can call LINPACK routine DQRDC for non-pivoted QR factorization, and LAPACK routine DGEQP3 for block pivoted QR factorization. Let's generate a toy matrix and test both options:
set.seed(0); X <- matrix(rnorm(50), 10, 5)
qr_linpack <- qr.default(X)
qr_lapack <- qr.default(X, LAPACK = TRUE)
str(qr_linpack)
# List of 4
# $ qr : num [1:10, 1:5] -3.79 -0.0861 0.3509 0.3357 0.1094 ...
# $ rank : int 5
# $ qraux: num [1:5] 1.33 1.37 1.03 1.01 1.15
# $ pivot: int [1:5] 1 2 3 4 5
# - attr(*, "class")= chr "qr"
str(qr_lapack)
# List of 4
# $ qr : num [1:10, 1:5] -3.79 -0.0646 0.2632 0.2518 0.0821 ...
# $ rank : int 5
# $ qraux: num [1:5] 1.33 1.21 1.56 1.36 1.09
# $ pivot: int [1:5] 1 5 2 4 3
# - attr(*, "useLAPACK")= logi TRUE
# - attr(*, "class")= chr "qr"
Note the $pivot is different for two objects.
Now, we define a wrapper function to compute QQ':
f <- function (QR) {
## thin Q-factor
Q <- qr.qy(QR, diag(1, nrow = nrow(QR$qr), ncol = QR$rank))
## QQ'
tcrossprod(Q)
}
We will see that qr_linpack and qr_lapack give the same projection matrix:
H1 <- f(qr_linpack)
H2 <- f(qr_lapack)
mean(abs(H1 - H2))
# [1] 9.530571e-17
Using singular value decomposition
In R, svd computes singular value decomposition. We still use the above example matrix X:
SVD <- svd(X)
str(SVD)
# List of 3
# $ d: num [1:5] 4.321 3.667 2.158 1.904 0.876
# $ u: num [1:10, 1:5] -0.4108 -0.0646 -0.2643 -0.1734 0.1007 ...
# $ v: num [1:5, 1:5] -0.766 0.164 0.176 0.383 -0.457 ...
H3 <- tcrossprod(SVD$u)
mean(abs(H1 - H3))
# [1] 1.311668e-16
Again, we get the same projection matrix.
Using Pivoted Cholesky factorization
For demonstration, we still use the example X above.
## pivoted Chol for `X'X`; we want lower triangular factor `L = R'`:
## we also suppress possible rank-deficient warnings (no harm at all!)
L <- t(suppressWarnings(chol(crossprod(X), pivot = TRUE)))
str(L)
# num [1:5, 1:5] 3.79 0.552 -0.82 -1.179 -0.182 ...
# - attr(*, "pivot")= int [1:5] 1 5 2 4 3
# - attr(*, "rank")= int 5
## compute `Q'`
r <- attr(L, "rank")
piv <- attr(L, "pivot")
Qt <- forwardsolve(L, t(X[, piv]), r)
## P = QQ'
H4 <- crossprod(Qt)
## compare
mean(abs(H1 - H4))
# [1] 6.983997e-17
Again, we get the same projection matrix.

Resources