Erratic behavior of a density plot - r

I have two numerical variables that I plotted in the following by means command density in R. The code is the following:
d0<-density(T0,n=2^14)
df_density0<-data.frame(x=d0$x,y=d0$y,stringsAsFactors = FALSE)
d1<-density(T1,n=2^14)
df_density1<-data.frame(x=d1$x,y=d1$y,stringsAsFactors = FALSE)
Initially, I had left the number of equally spaced points n at the default value 512, but then I realized that the area under a density plot d1 was not equal to 1 (it was around 13). Then I selected the proper number n in order to obtain an AUC (area under curve) near to 1 for both the density plots. In this way:
library(zoo)
x <- df_density0$x
y <- df_density0$y
id <- order(x)
AUC0 <- sum(diff(x[id])*rollmean(y[id],2))
x <- df_density1$x
y <- df_density1$y
id <- order(x)
AUC1 <- sum(diff(x[id])*rollmean(y[id],2))
For n=2^14 I obtained AUC0 and AUC1 equal to 0.9999... I plotted these density curves and I obtained the following graphic (blue is for T0 and red is for T1):
Since these graphs were indistinguishable from each other, I selected a logarithmic scale for the vertical axis:
Is it possible that I get such a result? Should I change the kernel field in the density function?
Ps. To draw these graphs I exported d0 and d1 in two csv files to import them into LaTeX. Anyway, I obtained the same plots in R. Moreover,
> nrow(T0)
[1] 9760
> nrow(T1)
[1] 1963

Related

Area between the two curves

I have two sets of data
I had plotted two probability density functions. Now I want the area between the two probability density functions, which are in certain x range.
I tried to integrate the area, trapezoidal rule etc:
Calculating the area between a curve and a straight line without finding the function
Error calculating the area between two lines using "integrate"
How to measure area between 2 distribution curves in R / ggplot2
but all are in vain.
Here is the link to the data i am working on.
https://sheet.zoho.com/sheet/editor.do?doc=1ff030ea1af35f06f8303927d7ea62b3c4b04bdae021555e8cc43ed0569cb2aaceb26368f93db4d15ac66cf7662d9a7873e889e1763139a49ffd68e7843e0b44
dens.pre=density(TX/10)
dens.post=density(TX30/10)`
plot(dens.pre,col="green")
lines(dens.post,col="red")
locator()
#$x
#[1] 18.36246
#$y
#[1] 0.05632428
abline(v=18.3,col="red")
Finding the area between the two curves for X > 18.3.
Area between the curves:
With trapezoidal rule you could probably calculate it like this:
d0 <- dens.pre
d1 <- dens.post
f0 <- approxfun(d0$x, d0$y)
f1 <- approxfun(d1$x, d1$y)
# defining x range of the density overlap
ovrng <- c(18.3, min(max(d0$x), max(d1$x)))
# dividing it to sections (for example n=500)
i <- seq(min(ovrng), max(ovrng), length.out=500)
# calculating the distance between the density curves
h1 <- f0(i)-f1(i)
h2 <- f1(i)-f0(i)
#and using the formula for the area of a trapezoid we add up the areas
area1<-sum( (h1[-1]+h1[-length(h1)]) /2 *diff(i) *(h1[-1]>=0+0)) # for the regions where d1>d0
area2<-sum( (h2[-1]+h2[-length(h2)]) /2 *diff(i) *(h2[-1]>=0+0)) # for the regions where d1<d0
area_total <- area1 + area2
area_total
Though, since you are interested only in the area where one curve remain below the other for the whole range, this can be shortened:
d0 <- dens.pre
d1 <- dens.post
f0 <- approxfun(d0$x, d0$y)
f1 <- approxfun(d1$x, d1$y)
# defining x range of the density overlap
ovrng <- c(18.3, min(max(d0$x), max(d1$x)))
# dividing it to sections (for example n=500)
i <- seq(min(ovrng), max(ovrng), length.out=500)
# calculating the distance between the density curves
h1 <- f1(i)-f0(i)
#and using the formula for the area of a trapezoid we add up the areas where d1>d0
area<-sum( (h1[-1]+h1[-length(h1)]) /2 *diff(i) *(h1[-1]>=0+0))
area
#We can plot the region using
plot(d0, main="d0=black, d1=green")
lines(d1, col="green")
jj<-which(h>0 & seq_along(h) %% 5==0); j<-i[jj];
segments(j, f1(j), j, f1(j)-h[jj])
There are other (and more detailed) solutions here and here

Subset 3D matrix using polygon coordinates

I'm working on some bioacoustical analysis and got stuck with an issue that I believe it can be worked out mathematically. I'll use an sound sample from seewavepackage:
library(seewave)
library(tuneR)
data(tico)
By storing a spectrogram (i.e. graphic representation of the sound wave tico) in an R object, we can now deal with the wave file computationally.
s <- spectro(tico, plot=F)
class(s)
>[1] "list"
length(s)
>[1] 3
The object created s consists in two numerical vectors x = s$time, y = s$freq representing the X and Y axis, respectively, and a matrix z = s$amp of amplitude values with the same dimensions of x and y. Z is a virtually a 3D matrix that can be plotted using persp3D (plot3D), plot_ly (plotly) or plot3d (rgl). Alternatively, the wave file can be plotted in 3D using seewave if one wishes to visualize it as an interative rgl plot.
spectro3D(tico)
That being said, the analysis I'm conducting aims to calculate contours of relative amplitude:
con <- contourLines(x=s$time, y=s$freq, z=t(s$amp), levels=seq(-25, -25, 1))
Select the longest contour:
n.con <- numeric(length(con))
for(i in 1:length(con)) n.con[i] <- length(con[[i]]$x)
n.max <- which.max(n.con)
con.max <- con[[n.max]]
And then plot the selected contour against the spectrogram of tico:
spectro(tico, grid=F, osc=F, scale=F)
polygon(x=con.max$x, y=con.max$y, lwd=2)
Now it comes the tricky part. I must find a way to "subset" the matrix of amplitude values s$amp using the coordinates of the longest contour con.max. What I aim to achieve is a new matrix containing only the amplitude values inside the polygon. The remaining parts of the spectrogram should then appear as blank spaces.
One approach I though it could work would be to create a loop that replaces every value outside the polygon for a given amplitude value (e.g. -25 dB). I once did an similar approach to remove the values below -30 dB and it worked out perfectly:
for(i in 1:length(s$amp)){if(s$amp[i] == -Inf |s$amp[i] <= -30)
{s$amp[i] <- -30}}
Another though would be to create a new matrix with the same dimensions of s$amp, subset s$amp using the coordinates of the contour, then replace the subset on the new matrix. Roughly:
mt <- matrix(-30, nrow=nrow(s$amp), ncol = ncol(s$amp))
sb <- s$amp[con.max$y, con.max$x]
new.mt <- c(mt, sb)
s$amp <- new.mt
I'll appreciate any help.

Filling a curve with points that fit under the curve in R plot

I was wondering how I can efficiently (using short R code) fill a curve with points that can fill up the area under my curve?
I have tried something without success, here is my R code:
data = rnorm(1000) ## random data points to fill the curve
curve(dnorm(x), -4, 4) ## curve to be filled by "data" above
points(data) ## plotting the points to fill the curve
Here's a method that uses interpolation to ensure that the plotted points won't exceed the height of the curve (although, if you want the actual point markers to not stick out above the curve, you'll need to set the threshold slightly below the height of the curve):
# Curve to be filled
c.pts = as.data.frame(curve(dnorm(x), -4, 4))
# Generate 1000 random points in the same x-interval and with y value between
# zero and the maximum y-value of the curve
set.seed(2)
pts = data.frame(x=runif(1000,-4,4), y=runif(1000,0,max(c.pts$y)))
# Using interpolation, keep only those points whose y-value is less than y(x)
pts = pts[pts$y < approx(c.pts$x,c.pts$y,xout=pts$x)$y, ]
# Plot the points
points(pts, pch=16, col="red", cex=0.7)
A method for plotting exactly a desired number of points under a curve
Responding to #d.b's comment, here's a way to get exactly a desired number of points plotted under a curve:
First, let's figure out how many random points we need to generate over the entire plot region in order to get (roughly) a target number of points under the curve. We do this as follows:
Calculate the area under the curve as a fraction of the area of the rectangle bounded by zero and the maximum height of the curve on the vertical axis, and by the width of the curve on the horizontal axis.
The number of random points we need to generate is the target number of points, divided by the area ratio calculated above.
# Area ratio
aa = sum(c.pts$y*median(diff(c.pts$x)))/(diff(c(-4,4))*max(c.pts$y))
# Target number of points under curve
n.target = 1000
# Number of random points to generate
n = ceiling(n.target/aa)
But we need more points than this to ensure we get at least n.target, because random variation will result in fewer than n.target points about half the time, once we limit the plotted points to those below the curve. So we'll add an excess.factor in order to generate more points under the curve than we need, then we'll just randomly select n.target of those points to plot. Here's a function that takes care of the entire process for a general curve.
# Plot a specified number of points under a curve
pts.under.curve = function(data, n.target=1000, excess.factor=1.5) {
# Area under curve as fraction of area of plot region
aa = sum(data$y*median(diff(data$x)))/(diff(range(data$x))*max(data$y))
# Number of random points to generate
n = excess.factor*ceiling(n.target/aa)
# Generate n random points in x-range of the data and with y value between
# zero and the maximum y-value of the curve
pts = data.frame(x=runif(n,min(data$x),max(data$x)), y=runif(n,0,max(data$y)))
# Using interpolation, keep only those points whose y-value is less than y(x)
pts = pts[pts$y < approx(data$x,data$y,xout=pts$x)$y, ]
# Randomly select only n.target points
pts = pts[sample(1:nrow(pts), n.target), ]
# Plot the points
points(pts, pch=16, col="red", cex=0.7)
}
Let's run the function for the original curve:
c.pts = as.data.frame(curve(dnorm(x), -4, 4))
pts.under.curve(c.pts)
Now let's test it with a different distribution:
# Curve to be filled
c.pts = as.data.frame(curve(df(x, df1=100, df2=20),0,5,n=1001))
pts.under.curve(c.pts, n.target=200)
n_points = 10000 #A large number
#Store curve in a variable and plot
cc = curve(dnorm(x), -4, 4, n = n_points)
#Generate 1000 random points
p = data.frame(x = seq(-4,4,length.out = n_points), y = rnorm(n = n_points))
#OR p = data.frame(x = runif(n_points,-4,4), y = rnorm(n = n_points))
#Find out the index of values in cc$x closest to p$x
p$ind = findInterval(p$x, cc$x)
#Only retain those points within the curve whose p$y are smaller than cc$y
p2 = p[p$y >= 0 & p$y < cc$y[p$ind],] #may need p[p$y < 0.90 * cc$y[p$ind],] or something
#Plot points
points(p2$x, p2$y)

R Surface Plot from List of X,Y,Z points

I am trying to make a surface plot for data that is in a very long list of x,y,z points. To do this, I am dividing the data into a grid of 10k squares and finding the max value of z within each square. From my understanding, each z value should be stored in a matrix where each element of the matrix corresponds to a square on the grid. Is there an easier way to do this than the code below? That last line is already pretty long and it is only one square.
x<-(sequence(101)-1)*max(eff$CFaR)/100
y<-(sequence(101)-1)*max(eff$EaR)/100
effmap<-matrix(ncol=length(x)-1, nrow=length(y)-1)
someMatrix <- max(eff$Cost[which(eff$EaR[which(eff$CFaR >= x[50] & eff$CFaR <x[51], arr.ind=TRUE)]>=y[20] & eff$EaR[which(eff$CFaR >= x[50] & eff$CFaR <x[51], arr.ind=TRUE)]< y[91])])
So this is my interpretation of what you are trying to accomplish...
df <- read.csv("effSample.csv") # downloaded from your link
df <- df[c("CFaR","EaR","Cost")] # remove unnecessary columns
df$x <- cut(df$CFaR,breaks=100,labels=FALSE) # establish bins: CFaR
df$y <- cut(df$EaR,breaks=100,labels=FALSE) # establish bins: EaR
df.max <- expand.grid(x=1:100,y=1:100) # template; 10,000 grid cells
# maximum cost in each grid cell - NOTE: most of the cells are *empty*
df.max <- merge(df.max,aggregate(Cost~x+y,df,max),all.x=TRUE)
z <- matrix(df.max$Cost,nr=100,nc=100) # Cost vector -> matrix
# colors based on z-value
palette <- rev(rainbow(20)) # palette of 20 colors
zlim <- range(z[!is.na(z)])
colors <- palette[19*(z-zlim[1])/diff(zlim) + 1]
# create the plot
library(rgl)
open3d(scale=c(1,1,10)) # CFaR and EaR range ~ 10 X Cost range
x.values <- min(df$CFaR)+(0:99)*diff(range(df$CFaR))/100
y.values <- min(df$EaR)+(0:99)*diff(range(df$EaR))/100
surface3d(x.values,y.values,z,col=colors)
axes3d()
title3d(xlab="CFaR",ylab="EaR",zlab="Cost")
The code above generates a rotatable 3D plot, so the image is just a screen shot. Notice how there are lots of "holes". This is (partially) because you provided only part of your data. However, it is important to realize that just because you imagine 10,000 grid cells (e.g., a 100 X 100 grid), does not mean that there will be data in every cell.

volume from the surface down to a contour in R

I need to find the volume from the surface down to a specific contour in R. Taking the contour example from the R help files:
x <- 10*1:nrow(volcano)
y <- 10*1:ncol(volcano)
contour(x,y,volcano)
given the resulting graph, how do I find the volume from a specific contour line up to the surface.
In practice, I will use bkde2D to get a density map for a scatter plot. From this I can make the contour plot, but I would like to determine the volume defined by various density cutoffs in the resulting plot.
Function contour just draw the contour lines but doesn't return any values. What you need to use is function contourLines.
cL <- contourLines(x,y,volcano)
From there, you can calculate the area of each contour lines the following way:
area <- rep(0,length(cL))
for(i in 1:length(cL)){
d <- data.frame(cL[[i]]$x,cL[[i]]$y)
sa <- sb <- 0
for(j in 1:(nrow(d)-1)){
sa <- sa+d[j,1]*d[j+1,2]
sb <- sb+d[j,2]*d[j+1,1]
}
area[i] <- abs((sa-sb)/2)
}
area
[1] 1.413924e+05 3.109685e+04 2.431528e+04 2.049473e+04 6.705976e+04 3.202145e+05 1.720469e+03
[8] 2.926802e+05 2.335421e+05 1.834791e+05 1.326162e+05 4.672784e+02 9.419792e+04 5.121851e+03
[15] 5.126860e+04 3.660862e-01 1.216750e+03 2.051307e+04 4.670745e+02 4.146927e+03
Now, if you want the volume between two contour lines (say between levels 120 and 130):
level1 <- 120
level2 <- 130
levels <- unlist(lapply(cL,function(x)x$level))
base <- (1:length(cL))[level==level1]
top <- (1:length(cL))[level==level2]
vol <- (level[top]-level[base])*(area[base]+area[top])/2
vol
[1] 2631111
And that's as far as I can go because I don't see how to proceed if the next contour line is split into several sectors.

Resources