This is my first post to the R-community, so pardon me if it is silly. I would like to use the functions geom_density2d and stat_density2d in ggplot2 to plot kernel density estimates, but the problem is that they can't handle weighted data. From what I understand, these two functions call the function kde2d from package MASS to make the kernel density estimate. And the kde2d doesn't take data weights as a parameter.
Now, I have found this altered version of kde2d http://www.inside-r.org/node/226757, which takes weights as a parameter and is based on the source code of kde2d. The code of this function:
kde2d.weighted <- function (x, y, w, h, n = 25, lims = c(range(x), range(y))) {
nx <- length(x)
if (length(y) != nx)
stop("data vectors must be the same length")
if (length(w) != nx & length(w) != 1)
stop("weight vectors must be 1 or length of data")
gx <- seq(lims[1], lims[2], length = n) # gridpoints x
gy <- seq(lims[3], lims[4], length = n) # gridpoints y
if (missing(h))
h <- c(bandwidth.nrd(x), bandwidth.nrd(y));
if (missing(w))
w <- numeric(nx)+1;
h <- h/4
ax <- outer(gx, x, "-")/h[1] # distance of each point to each grid point in x-direction
ay <- outer(gy, y, "-")/h[2] # distance of each point to each grid point in y-direction
z <- (matrix(rep(w,n), nrow=n, ncol=nx, byrow=TRUE)*matrix(dnorm(ax), n, nx)) %*% t(matrix(dnorm(ay), n, nx))/(sum(w) * h[1] * h[2]) # z is the density
return(list(x = gx, y = gy, z = z))
}
I would like to make the functions geom_density2d and stat_density2d call kd2d.weighted instead of kde2d, and by that making them accept weighted data.
I have never changed any functions in existing R packages so my question is what is the easiest way doing this?
You can actually pass your own density data to geom_contour which would probably be the easiest. Let's start with a sample dataset by adding weights to the geyser data.
library("MASS")
data(geyser, "MASS")
geyserw <- transform(geyser,
weight = sample(1:5, nrow(geyser), replace=T)
)
Now we use your weighted function to calculate the density and turn it into a data.frame
dens <- kde2d.weighted(geyserw$duration, geyserw$waiting, geyserw$weight)
dfdens <- data.frame(expand.grid(x=dens$x, y=dens$y), z=as.vector(dens$z))
Now we plot the data
ggplot(geyserw, aes(x = duration, y = waiting)) +
geom_point() + xlim(0.5, 6) + ylim(40, 110) +
geom_contour(aes(x=x, y=y, z=z), data= dfdens)
And that should do it
Related
I would like to plot a sphere in R with the gridlines on the surface corresponding to the equal area gridding of the sphere using the arcos transformation.
I have been experimenting with the R packakge rgl and got some help from :
Plot points on a sphere in R
Which plots the gridlines with equal lat long spacing.
I have the below function which returns a data frame of points that are the cross over points of the grid lines I want, but not sure how to proceed.
plot_sphere <- function(theta_num,phi_num){
theta <- seq(0,2*pi,(2*pi)/(theta_num))
phi <- seq(0,pi,pi/(phi_num))
tmp <- seq(0,2*phi_num,2)/phi_num
phi <- acos(1-tmp)
tmp <- cbind(rep(seq(1,theta_num),each = phi_num),rep(seq(1,phi_num),times = theta_num))
results <- as.data.frame(cbind(theta[tmp[,1]],phi[tmp[,2]]))
names(results) <- c("theta","phi")
results$x <- cos(results$theta)*sin(results$phi)
results$y <- sin(results$theta)*sin(results$phi)
results$z <- cos(results$phi)
return(results)
}
sphere <- plot_sphere(10,10)
Can anyone help, in general I am finding the rgl functions tricky to work with.
If you use lines3d or plot3d(..., type="l"), you'll get a plot joining the points in your dataframe. To get breaks (you don't want one long line), add rows containing NA values.
The code in your plot_sphere function seems really messed up (you compute phi twice, you don't generate vectors of the requested length, etc.), but this function based on it works:
function(theta_num,phi_num){
theta0 <- seq(0,2*pi, len = theta_num)
tmp <- seq(0, 2, len = phi_num)
phi0 <- acos(1-tmp)
i <- seq(1, (phi_num + 1)*theta_num) - 1
theta <- theta0[i %/% (phi_num + 1) + 1]
phi <- phi0[i %% (phi_num + 1) + 1]
i <- seq(1, phi_num*(theta_num + 1)) - 1
theta <- c(theta, theta0[i %% (theta_num + 1) + 1])
phi <- c(phi, phi0[i %/% (theta_num + 1) + 1])
results <- data.frame( x = cos(theta)*sin(phi),
y = sin(theta)*sin(phi),
z = cos(phi))
lines3d(results)
}
I am trying to plot the following function in R Studio using the curve function as follows:
loglikelihood.func = function(x, mu){
n = length(x)
n*mu - sum(x) - sum(exp(mu)/(exp(x)))
}
curve(expr = loglikelihood.func(x = data, mu), xname = "mu", from
= 0, to = 15)
Now, I have a vector of data that contains 50 data points and in the function, it is currently summing both mu and x (my data) i.e. sum(exp(mu)/(exp(x))) is equivalent to sum(exp(mu))/sum(exp(x))
I want my function to work so that I have the fraction sum for each different data point in x while keeping mu constant i.e. exp(mu)/exp(x1) + exp(mu)/exp(x2) + exp(mu)/exp(x3) + ... and repeat this for each separate mu when it plots in the curve function.
If I change my function to exp(mu)/sum(exp(x)) it's doing exp(mu)/[exp(x1)+exp(x2)+...] which is not what I want. Can someone offer some advice here?
Edit:
This is a subset of my data,
data = c(8.5,8.9,9.1,8.9,8.4,9.7,9.1,9.6,8.7,9.3,9.6,9.3,8.7,9.0,8.8,8.9,8.9,12.2)
Not sure I get this right... But curve() can't do as much by itself. You can define your function:
loglikelihood.func = function(x, mu) {
length(x) * mu - sum(x) - sum(exp(mu)/(exp(x)))
}
Then define a range (for x) over which you'd like to plot it, and specify the limits (xlim, ylim) on a first plot. If you don't want to have a sequence, you can use your data instead.
xrange <- seq(from=1, to=10, by = 0.1)
plot(x=xrange, y = sapply(xrange, function(x) loglikelihood.func(x, mu=0)),
xlim = c(1, 10),
ylim = c(-10, 0),
type = "l")
Then add other curves, specifying different mu's:
lines(x=xrange, y = sapply(xrange, function(x) loglikelihood.func(x, mu=1)))
lines(x=xrange, y = sapply(xrange, function(x) loglikelihood.func(x, mu=2)))
(More practical doing with a loop if you don't need extra graph parameters)
The task:
Eric the fly has a friend, Ernie. Assume that the two flies sit at independent locations, uniformly distributed on the globe’s surface. Let D denote the Euclidean distance between Eric and Ernie (i.e., on a straight line through the interior of the globe).
Make a conjecture about the probability density function of D and give an
estimate of its expected value, E(D).
So far I have made a function to generate two points on the globe's surface, but I am unsure what to do next:
sample3d <- function(2)
{
df <- data.frame()
while(n > 0){
x <- runif(1,-1,1)
y <- runif(1,-1,1)
z <- runif(1,-1,1)
r <- x^2 + y^2 + z^2
if (r < 1){
u <- sqrt(x^2+y^2+z^2)
vector = data.frame(x = x/u,y = y/u, z = z/u)
df <- rbind(vector,df)
n = n- 1
}
}
df
}
E <- sample3d(2)
This is an interesting problem. I'll outline a computational approach; I'll leave the math up to you.
First we fix a random seed for reproducibility.
set.seed(2018);
We sample 10^4 points from the unit sphere surface.
sample3d <- function(n = 100) {
df <- data.frame();
while(n > 0) {
x <- runif(1,-1,1)
y <- runif(1,-1,1)
z <- runif(1,-1,1)
r <- x^2 + y^2 + z^2
if (r < 1) {
u <- sqrt(x^2 + y^2 + z^2)
vector = data.frame(x = x/u,y = y/u, z = z/u)
df <- rbind(vector,df)
n = n- 1
}
}
df
}
df <- sample3d(10^4);
Note that sample3d is not very efficient, but that's a different issue.
We now randomly sample 2 points from df, calculate the Euclidean distance between those two points (using dist), and repeat this procedure N = 10^4 times.
# Sample 2 points randomly from df, repeat N times
N <- 10^4;
dist <- replicate(N, dist(df[sample(1:nrow(df), 2), ]));
As pointed out by #JosephWood, the number N = 10^4 is somewhat arbitrary. We are using a bootstrap to derive the empirical distribution. For N -> infinity one can show that the empirical bootstrap distribution is the same as the (unknown) population distribution (Bootstrap theorem). The error term between empirical and population distribution is of the order 1/sqrt(N), so N = 10^4 should lead to an error around 1%.
We can plot the resulting probability distribution as a histogram:
# Let's plot the distribution
ggplot(data.frame(x = dist), aes(x)) + geom_histogram(bins = 50);
Finally, we can get empirical estimates for the mean and median.
# Mean
mean(dist);
#[1] 1.333021
# Median
median(dist);
#[1] 1.41602
These values are close to the theoretical values:
mean.th = 4/3
median.th = sqrt(2)
I am trying to plot in 3d with wireframe, here is my code:
mecdf = function (nr, nc, x, u) # mecdf is the bivariate cumulative empiric function
{ k = rep (TRUE, nr)
for (j in 1:nc) k = k & (x [,j] <= u [j])
sum (k) / nr
}
xc = round(runif(100), 2)
yc = round(runif(100), 2)
Da = cbind(xc, yc)
bcdfa<-rep(NA,100)
for (i in 1:100) {bcdfa[i]=mecdf(nrow(Da),ncol(Da),Da,Da[i,])}
bcdfa
x<-cbind(xc,yc,bcdfa) # bcdfa is the value of the bcdfa in every (xc,yc) point
...........
Is it possible to use the wireframe or another function with these data to represent the bivariate cumulative distribution function ?
Here's one way.
library(akima)
library(plot3D)
bcfoo<-interp(xc,yc,bcdfa)
persp3D(bcfoo$x,bcfoo$y,bcfoo$z)
#plot points to verify overall shape
scatter3D(xc,yc,bcdfa)
I would like to compute the convolution of two probability distributions in R and I need some help. For the sake of simplicity, let's say I have a variable x that is normally distributed with mean = 1.0 and stdev = 0.5, and y that is log-normally distributed with mean = 1.5 and stdev = 0.75. I want to determine z = x + y. I understand that the distribution of z is not known a priori.
As an aside the real world example I am working with requires addition to two random variables that are distributed according to a number of different distributions.
Does anyone know how to add two random variables by convoluting the probability density functions of x and y?
I have tried generating n normally distributed random values (with above parameters) and adding them to n log-normally distributed random values. However, I wish to know if I can use the convolution method instead. Any help would be greatly appreciated.
EDIT
Thank you for these answers. I define a pdf, and try to do the convolution integral, but R complains on the integration step. My pdfs are Log Pearson 3 and are as follows
dlp3 <- function(x, a, b, g) {
p1 <- 1/(x*abs(b) * gamma(a))
p2 <- ((log(x)-g)/b)^(a-1)
p3 <- exp(-1* (log(x)-g) / b)
d <- p1 * p2 * p3
return(d)
}
f.m <- function(x) dlp3(x,3.2594,-0.18218,0.53441)
f.s <- function(x) dlp3(x,9.5645,-0.07676,1.184)
f.t <- function(z) integrate(function(x,z) f.s(z-x)*f.m(x),-Inf,Inf,z)$value
f.t <- Vectorize(f.t)
integrate(f.t, lower = 0, upper = 3.6)
R complains at the last step since the f.t function is bounded and my integration limits are probably not correct. Any ideas on how to solve this?
Here is one way.
f.X <- function(x) dnorm(x,1,0.5) # normal (mu=1.5, sigma=0.5)
f.Y <- function(y) dlnorm(y,1.5, 0.75) # log-normal (mu=1.5, sigma=0.75)
# convolution integral
f.Z <- function(z) integrate(function(x,z) f.Y(z-x)*f.X(x),-Inf,Inf,z)$value
f.Z <- Vectorize(f.Z) # need to vectorize the resulting fn.
set.seed(1) # for reproducible example
X <- rnorm(1000,1,0.5)
Y <- rlnorm(1000,1.5,0.75)
Z <- X + Y
# compare the methods
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
Same thing using package distr.
library(distr)
N <- Norm(mean=1, sd=0.5) # N is signature for normal dist
L <- Lnorm(meanlog=1.5,sdlog=0.75) # same for log-normal
conv <- convpow(L+N,1) # object of class AbscontDistribution
f.Z <- d(conv) # distribution function
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
I was having trouble getting integrate() to work for different density parameters, so I came up with an alternative to #jlhoward's using Riemann approximation:
set.seed(1)
#densities to be convolved. could also put these in the function below
d1 <- function(x) dnorm(x,1,0.5) #
d2 <- function(y) dlnorm(y,1.5, 0.75)
#Riemann approximation of convolution
conv <- function(t, a, b, d) { #a to b needs to cover the range of densities above. d needs to be small for accurate approx.
z <- NA
x <- seq(a, b, d)
for (i in 1:length(t)){
print(i)
z[i] <- sum(d1(x)*d2(t[i]-x)*d)
}
return(z)
}
#check against sampled convolution
X <- rnorm(1000, 1, 0.5)
Y <- rlnorm(1000, 1.5, 0.75)
Z <- X + Y
t <- seq(0, 50, 0.05) #range to evaluate t, smaller increment -> smoother curve
hist(Z, breaks = 50, freq = F, xlim = c(0,30))
lines(t, conv(t, -100, 100, 0.1), type = "s", col = "red")